Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-10-08 Thread Mike Beckerle
Nevermind. I figured this out. Was due to 'properties' being a reserved
keyword. I created a PR to fix the JSON doc on the drill site.

On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle  wrote:

> Ok, after weeks of delay
>
> That helps a great deal. You flatten the array of maps into a table of
> maps.
>
> I am confused still about when I must do square brackets versus dot
> notation: data['a'] vs. data.a
> The JSON documentation for Drill uses dot notation to reach into fields of
> a map.
>
> Ex: from the JSON doc:
>
> {
>   "type": "FeatureCollection",
>   "features": [
>   {
> "type": "Feature",
> "properties":
> {
>   "MAPBLKLOT": "0001001",
>   "BLKLOT": "0001001",
>   "BLOCK_NUM": "0001",
>   "LOT_NUM": "001",
>
>
> The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
> Which is using dot notation where in your queries on my JSON you did not
> use dot notation.
>
> I tried revising the queries you wrote using the dot notation, and it was
> rejected. "no table named 'data'", but I'm not sure why.
>
> Ex:
>
> This works: (your original working query)
>
> SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;
>
> But this fails:
>
> SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
> Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
> Table 'data' not found
>
> But your sub-select defines 'data' as, I would assume, a table.
>
> Can you help me clarify this?
>
> [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
> apache drill>
>
>
> On Mon, Sep 18, 2023 at 11:17 PM Charles Givre  wrote:
>
>> Hi Mike,
>> Let me answer your question with some queries:
>>
>>  >>> select * from dfs.test.`record.json`;
>>
>> +--+
>> |  record
>>  |
>>
>> +--+
>> |
>> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
>> |
>>
>> +--+
>>
>> Now... I can flatten that like this:
>>
>> >>> select flatten(record) AS data from dfs.test.`record.json`;
>> +--+
>> | data |
>> +--+
>> | {"a":{"a1":5.0,"a2":6.0},"b":{}} |
>> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
>> | {"a":{"a1":7.0,"a2":8.0},"b":{}} |
>> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
>> +--+
>> 4 rows selected (0.298 seconds)
>>
>> You asked about filtering.   For this, I broke it up into a subquery, but
>> here's how I did that:
>>
>> >>> SELECT data['a'], data['b']
>> 2..semicolon> FROM (select flatten(record) AS data from
>> dfs.test.`record.json`)
>> 3..semicolon> WHERE data['b']['b1'] > 60.0;
>> ++-+
>> | EXPR$0 | EXPR$1  |
>> ++-+
>> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
>> ++-+
>> 1 row selected (0.379 seconds)
>>
>> I did all this without the union data type.
>>
>> Does this make sense?
>> Best,
>> -- C
>>
>>
>> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
>>
>> I'm thinking whether a first prototype of DFDL integration to Drill should
>> just use JSON.
>>
>> But please consider this JSON:
>>
>> { "record": [
>>{ "a": { "a1":5, "a2":6 } },
>>{ "b": { "b1":55, "b2":66, "b3":77 } }
>>{ "a": { "a1":7, "a2":8 } },
>>{ "b": { "b1":77, "b2":88, "b3":99 } }
>>  ] }
>>
>> It corresponds to this text data file, parsed using Daffodil:
>>
>>105062556677107082778899
>>
>> The file is a stream of records. The first byte is a tag value 1 for type
>> 'a' records, and 2 for type 'b' records.
>> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
>> a2. They are integers.
>> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1,
>> b2,
>> and b3. They are integers.
>> This kind of format is very common, even textualized like this (from COBOL
>> programs for example)
>>
>> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
>> (and ... does this require the experimental Union feature?)
>>
>> b1, b2
>> -
>> (55, 66)
>> (77, 88)
>>
>> I ask because in an XML Schema or DFDL schema choices with dozens of
>> 'branches' are very common.
>> Ex: schema for the above data:
>>
>> 
>>   
>>  
>>  
>>   
>>
>>... many child elements let's say named a1, a2, ...
>>

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-10-07 Thread Mike Beckerle
Ok, after weeks of delay

That helps a great deal. You flatten the array of maps into a table of maps.

I am confused still about when I must do square brackets versus dot
notation: data['a'] vs. data.a
The JSON documentation for Drill uses dot notation to reach into fields of
a map.

Ex: from the JSON doc:

{
  "type": "FeatureCollection",
  "features": [
  {
"type": "Feature",
"properties":
{
  "MAPBLKLOT": "0001001",
  "BLKLOT": "0001001",
  "BLOCK_NUM": "0001",
  "LOT_NUM": "001",
   

The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
Which is using dot notation where in your queries on my JSON you did not
use dot notation.

I tried revising the queries you wrote using the dot notation, and it was
rejected. "no table named 'data'", but I'm not sure why.

Ex:

This works: (your original working query)

SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;

But this fails:

SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
Table 'data' not found

But your sub-select defines 'data' as, I would assume, a table.

Can you help me clarify this?

[Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
apache drill>


On Mon, Sep 18, 2023 at 11:17 PM Charles Givre  wrote:

> Hi Mike,
> Let me answer your question with some queries:
>
>  >>> select * from dfs.test.`record.json`;
>
> +--+
> |  record
>  |
>
> +--+
> |
> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
> |
>
> +--+
>
> Now... I can flatten that like this:
>
> >>> select flatten(record) AS data from dfs.test.`record.json`;
> +--+
> | data |
> +--+
> | {"a":{"a1":5.0,"a2":6.0},"b":{}} |
> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
> | {"a":{"a1":7.0,"a2":8.0},"b":{}} |
> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
> +--+
> 4 rows selected (0.298 seconds)
>
> You asked about filtering.   For this, I broke it up into a subquery, but
> here's how I did that:
>
> >>> SELECT data['a'], data['b']
> 2..semicolon> FROM (select flatten(record) AS data from
> dfs.test.`record.json`)
> 3..semicolon> WHERE data['b']['b1'] > 60.0;
> ++-+
> | EXPR$0 | EXPR$1  |
> ++-+
> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
> ++-+
> 1 row selected (0.379 seconds)
>
> I did all this without the union data type.
>
> Does this make sense?
> Best,
> -- C
>
>
> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
>
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
>{ "a": { "a1":5, "a2":6 } },
>{ "b": { "b1":55, "b2":66, "b3":77 } }
>{ "a": { "a1":7, "a2":8 } },
>{ "b": { "b1":77, "b2":88, "b3":99 } }
>  ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
>105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> -
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> 
>   
>  
>  
>   
>
>... many child elements let's say named a1, a2, ...
> 
>   
>  
>  
>   
>
>... many child elements let's say named b1, b2, b3
> ...
> 
>   
>  
>
>  
> 
>
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
>
> On Sun, Aug 6, 

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-18 Thread Charles Givre
Hi Mike, 
Let me answer your question with some queries:

 >>> select * from dfs.test.`record.json`;
+--+
|  record   
   |
+--+
| 
[{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
 |
+--+

Now... I can flatten that like this:

>>> select flatten(record) AS data from dfs.test.`record.json`;
+--+
| data |
+--+
| {"a":{"a1":5.0,"a2":6.0},"b":{}} |
| {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
| {"a":{"a1":7.0,"a2":8.0},"b":{}} |
| {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
+--+
4 rows selected (0.298 seconds)

You asked about filtering.   For this, I broke it up into a subquery, but 
here's how I did that:

>>> SELECT data['a'], data['b']
2..semicolon> FROM (select flatten(record) AS data from dfs.test.`record.json`)
3..semicolon> WHERE data['b']['b1'] > 60.0;
++-+
| EXPR$0 | EXPR$1  |
++-+
| {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
++-+
1 row selected (0.379 seconds)

I did all this without the union data type.  

Does this make sense?
Best,
-- C


> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
> 
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
> 
> But please consider this JSON:
> 
> { "record": [
>{ "a": { "a1":5, "a2":6 } },
>{ "b": { "b1":55, "b2":66, "b3":77 } }
>{ "a": { "a1":7, "a2":8 } },
>{ "b": { "b1":77, "b2":88, "b3":99 } }
>  ] }
> 
> It corresponds to this text data file, parsed using Daffodil:
> 
>105062556677107082778899
> 
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
> 
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
> 
> b1, b2
> -
> (55, 66)
> (77, 88)
> 
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
> 
> 
>   
>  
>  
>   
>
>... many child elements let's say named a1, a2, ...
> 
>   
>  
>  
>   
>
>... many child elements let's say named b1, b2, b3
> ...
> 
>   
>  
>
>  
> 
> 
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
> 
> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre 
> wrote:
> 
>> @mbeckerle 
>> You've encountered another challenge that exists in Drill reading data
>> without a schema.
>> Let me explain a bit about this and I'm going to use the JSON reader as an
>> example. First Drill requires data to be homogeneous. Drill does have a
>> Union vector type which allows heterogeneous data however this is a bit
>> experimental and I wouldn't recommend using it. Also, it really just shifts
>> schema inconsistencies to the user.
>> 
>> For instance, let's say you have a column consisting of strings and
>> floats. What happens if you try to do something like this:
>> 
>> SELECT sum(mixed_col)-- orSELECT ORDER BY mixed_col
>> 
>> Remembering that Drill is distributed and if you have a column with the
>> same name and you try to do these operations, they will fail.
>> 
>> Let's say we have data like this:
>> 
>> [
>>  {
>> 'col1': 'Hi there',
>> 'col2': 5.0
>>  },
>>  {
>> 'col1':True,
>> 'col2': 4,
>> 'col3': 'foo'
>>  }
>> ]
>> 
>> In older versions of Drill, this kind of data, this would throw all kinds
>> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>>  submitted apache#2638
>>  which overhauled implicit
>> casting. What this meant for users is that col2 in the above would be
>> automatically cast to a FLOAT and col1 would be automatically cast to a
>> VARCHAR.
>> 

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-13 Thread Paul Rogers
Hi Mike,

Looks like you are wrestling with two separate issues. The first is how to
read the encoded data that you showed. In Drill, each data format generally
needs its own reader. Drill's reader operator provides all the plumbing
needed to handle multiple format readers, pack data into vectors, handle
projection and all the rest. But, to actually parse a stream of bytes into
data values, Drill needs a reader (AKA format plugin).

If you were to write such a reader for your encoded data format, then each
parse of a data value would write that value to a vector, potentially
creating the vector as needed (and back filling null values). All this is
handled automagically by the enhanced vector framework (EVF).

Now, you *could* do something that reads the encoded records, emits JSON,
and lets the JSON parser parse it again. I suspect you'd find that doing so
is a) much slower, and b) more work than just creating the required reader.
Either way, you'll need code for each format that Daffodil supports but
which Drill does not yet support.

One quick answer: avoid the UNION type if you can. It works, but barely. It
is slow, inefficient, not supported in many operators, and is unknown to
client libraries. Since Daffodil is all about schemas, use the schema to
figure out the correct type. If Daffodil allows UNION types, only then
would it make sense to map them to Drill's UNION type, and deal with the
many limitations.

Finally, on to your query. Let's reference the JSON form. Let's assume that
if you had a COBOL parser, it would end up with the same vector structure
as the JSON parser would produce: different ways to parse data, but same
internal data structures. To do the query you want, you'd have to:

* Flatten the content of `record` in each record. That is, JSON would read
the above as a set of records, each of which has one field called `record`
which is an array of maps (i.e. a repeated map.) Flattening produces a
stream of maps.
* Then, you'd project `a` and `b` to the top level, giving you two
top-level fields called `a` and `b`. Alternatively, project `a.a1`, `a.a2`,
`b.b1` and `b.b2` to the top level. The values for the "missing" map will
be SQL NULL.
* Finally, state your query as usual SELECT b1, b2 WHERE b1 > 10 FROM
...nested unpack queries here...

As noted in a previous response, SQL (and Drill) doesn't have the
expressiveness to do complex queries while leaving data in its original
structured form. That said, this query might actually work:

SELECT `record` WHERE `record`.`b`.`b1` > 10 FROM 

I *think* this will return a set of records, with a `record` array, a `b`
map, and `b1` and `b2` members that satisfies the query. This works because
the '`record`.`b`.`b1` > 10' expression will be FALSE if b1 (or b) is NULL.
Still, for the sake of clients, you'd want to flatten the results to the
top level.

Charles is really the query expert, he might have other tricks that he's
found that work better.

Drill is smart enough to push projection down into the reader: that's one
of the fancy bits that EVF handles. EVF will notice that we only want
`record`.`b`.`b1` and `b2` and won't project the map `a` or any of its
contents. When the reader provides those values, then will simply go into
the bit bucket. (Caveat: there are some limitations on this feature: I have
some long-delayed fixes that you might need.)

I hope this helps.

Thanks,

- Paul

On Wed, Sep 13, 2023 at 8:09 AM Mike Beckerle  wrote:

> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
> { "a": { "a1":5, "a2":6 } },
> { "b": { "b1":55, "b2":66, "b3":77 } }
> { "a": { "a1":7, "a2":8 } },
> { "b": { "b1":77, "b2":88, "b3":99 } }
>   ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
> 105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> -
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> 
>
>   
>   
>
> 
> ... many child elements let's say named a1, a2, ...
>  
>
>   
>   
>
> 
> ... many child elements let's say named b1, b2, b3
> ...
>  
>

Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-13 Thread Mike Beckerle
I'm thinking whether a first prototype of DFDL integration to Drill should
just use JSON.

But please consider this JSON:

{ "record": [
{ "a": { "a1":5, "a2":6 } },
{ "b": { "b1":55, "b2":66, "b3":77 } }
{ "a": { "a1":7, "a2":8 } },
{ "b": { "b1":77, "b2":88, "b3":99 } }
  ] }

It corresponds to this text data file, parsed using Daffodil:

105062556677107082778899

The file is a stream of records. The first byte is a tag value 1 for type
'a' records, and 2 for type 'b' records.
The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
a2. They are integers.
The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
and b3. They are integers.
This kind of format is very common, even textualized like this (from COBOL
programs for example)

Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
(and ... does this require the experimental Union feature?)

b1, b2
-
(55, 66)
(77, 88)

I ask because in an XML Schema or DFDL schema choices with dozens of
'branches' are very common.
Ex: schema for the above data:


   
  
  
   

... many child elements let's say named a1, a2, ...
 
   
  
  
   

... many child elements let's say named b1, b2, b3
...
 
   
  

  


To me XSD choice naturally requires a Union feature of some sort.
If that's expermental still in Drill ... what to do?

On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre 
wrote:

> @mbeckerle 
> You've encountered another challenge that exists in Drill reading data
> without a schema.
> Let me explain a bit about this and I'm going to use the JSON reader as an
> example. First Drill requires data to be homogeneous. Drill does have a
> Union vector type which allows heterogeneous data however this is a bit
> experimental and I wouldn't recommend using it. Also, it really just shifts
> schema inconsistencies to the user.
>
> For instance, let's say you have a column consisting of strings and
> floats. What happens if you try to do something like this:
>
> SELECT sum(mixed_col)-- orSELECT ORDER BY mixed_col
>
> Remembering that Drill is distributed and if you have a column with the
> same name and you try to do these operations, they will fail.
>
> Let's say we have data like this:
>
> [
>   {
>  'col1': 'Hi there',
>  'col2': 5.0
>   },
>   {
>  'col1':True,
>  'col2': 4,
>  'col3': 'foo'
>   }
> ]
>
> In older versions of Drill, this kind of data, this would throw all kinds
> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>  submitted apache#2638
>  which overhauled implicit
> casting. What this meant for users is that col2 in the above would be
> automatically cast to a FLOAT and col1 would be automatically cast to a
> VARCHAR.
>
> However, when reading data the story is a little different. What we did
> for the JSON reader was have several read modes. The least tolerant
> attempts to infer all data types. This seems like a great idea in practice,
> however when you start actually using Drill with real data, you start
> seeing the issues with this approach. The JSON reader has a few
> configuration options that increase its tolerance for bad data. The next
> level is readAllNumbersAsDouble which... as the name implies, reads all
> numeric data as Doubles and does not attempt to infer ints vs floats. The
> next options is allTextMode which reads all fields as VARCHAR. This
> should be used when the data is so inconsistent that it cannot be read with
> either mode. These modes can be set globally, at the plugin level or at
> query time.
>
> For the XML reader, I didn't add type inference because I figured the data
> would be quite messy, however it wouldn't be that hard to add basically the
> same levels as the JSON reader.
>
> This fundamental issue exists in all the readers that read data without a
> schema. My rationale for working on the XSD reader is that this will enable
> us to accurately read XML data with all the correct data types.
>
> —
> Reply to this email directly, view it on GitHub
> , or
> unsubscribe
> 
> .
> You are receiving this because you were mentioned.Message ID:
> 
>