Re: Drill SQL questions - JSON context

2023-08-18 Thread Paul Rogers
Hi Mike,

Good progress! There are a number of factors to consider. Let's work
through them one by one.

First, try the simplest possible query:

SELECT * FROM 

If you are using the row set mechanism, grab the schema and print it. (My
memory is hazy, but I do believe that there are methods and classes that
will do this for you.) What you should see is the nested structure you
created. The JSON reader has a super-complex parser that will work out
structure and types based on the first value seen. In your example, it
should guess VARCHAR and INT for your data items.

Once you confirm that the JSON parser has correctly interpreted your data,
you can move onto the second question: how SQL works with structured data.
Here we have to realize that SQL wasn't designed for structured data: SQL
only knows how to work with variables projected to the top level. This
leads to a quantum-like result that observing a variable changes its
structure. The JSON reader uses something called the Enhanced Vector
Framework (EVF) to do the projecting at scan time. Let's work out what it
should be doing. (The code is gawd-awful complex, so there is a chance that
something might be broken.)

In your query, you are projecting c.f.g to a top level variable g. Fields c
and f are an array of maps. I can't recall testing this kind of projection,
but I'd expect it to result in the projected variable g being an array or
arrays: what Drill calls a repeated list. Although I wrote this stuff, I
don't recall any code that will convert a repeated map into a repeated
list: so this area may be a bit tender. Or, maybe it just punts and leaves
the repeated map, but with a single entry? That wouldn't quite work. This
could use a bit of testing.

The third question is how to flatten rows. Flattening occurs via a separate
flatten operator. You'd need to flatten twice: once for each level. This
whole area is a bit hazy for me (I'm not super familiar with the details),
but I suspect you'd need to use a set of nested SELECT statements, each of
which flattens the outermost level, which will project the result to the
top level where it can be manipulated by the SELECT at the next outer
level. To try this, extend your SELECT * to select just a top-level field
(a) and flatten a top-level repeated map (f). The result should be rows
with a scalar and a repeated map. Then, add another level of SELECT to
flatten the repeated map: you'll get a scalar and a map. Then, use yet
another SELECT to pick out the map fields to top-level fields, and do the
WHERE clause. I *think* that should more-or-less work.

- Paul

On Fri, Aug 18, 2023 at 2:01 PM Mike Beckerle  wrote:

> I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
> moment, until we build a tighter integration. This is my conceptual test
> framework for that integration.)
>
> I have parsed data to create this JSON which represents 2-level nested
> repeating subrecords.
>
> All the simple fields are int.
>
> [{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
>  {"g":7,  "h":8 }]},
>{"d":9,  "e":10, "f":[{"g":11, "h":12},
>  {"g":13, "h":14}]}]},
>  {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
>  {"g":27, "h":28 }]},
>{"d":29, "e":30, "f":[{"g":31, "h":32},
>  {"g":33, "h":34}]}]}]
>
> So, the top level is a vector of maps,
> within that, field "c" is a vector of maps,
> and within "c" is a field f which is a vector of maps.
>
> The reason I created this is I'm trying to understand the arrays and how
> they work with Drill SQL.
>
> I'm trying to figure out how to get this rowset of 3 rows from a query, and
> I'm stumped.
>
>   a   b   d   e   g   h
> ( 1,  2,  3,  4,  5,  6)
> ( 1,  2,  9, 10, 13, 14)
> (21, 22, 29, 30, 33, 34)
>
> This is the SQL that is my conceptual framework, but I'm sure it won't
> work.
>
> SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
> FROM ... the json file...
> WHERE g mod 10 == 3 OR g == 5
>
> But I know it's not going to be that easy to get the query to traverse the
> vector inside the vector.
>
> From the doc, the FLATTEN operator seems to be needed, but I can't really
> figure it out.
>
> This is what all my data is like. Trees of nested vectors of sub-records.
>
> Can anyone advise on what the SQL might look like, or where there's an
> example doing something like this I can learn from?
>
> Thanks for any help
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>


Drill SQL questions - JSON context

2023-08-18 Thread Mike Beckerle
I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
moment, until we build a tighter integration. This is my conceptual test
framework for that integration.)

I have parsed data to create this JSON which represents 2-level nested
repeating subrecords.

All the simple fields are int.

[{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
 {"g":7,  "h":8 }]},
   {"d":9,  "e":10, "f":[{"g":11, "h":12},
 {"g":13, "h":14}]}]},
 {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
 {"g":27, "h":28 }]},
   {"d":29, "e":30, "f":[{"g":31, "h":32},
 {"g":33, "h":34}]}]}]

So, the top level is a vector of maps,
within that, field "c" is a vector of maps,
and within "c" is a field f which is a vector of maps.

The reason I created this is I'm trying to understand the arrays and how
they work with Drill SQL.

I'm trying to figure out how to get this rowset of 3 rows from a query, and
I'm stumped.

  a   b   d   e   g   h
( 1,  2,  3,  4,  5,  6)
( 1,  2,  9, 10, 13, 14)
(21, 22, 29, 30, 33, 34)

This is the SQL that is my conceptual framework, but I'm sure it won't work.

SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
FROM ... the json file...
WHERE g mod 10 == 3 OR g == 5

But I know it's not going to be that easy to get the query to traverse the
vector inside the vector.

>From the doc, the FLATTEN operator seems to be needed, but I can't really
figure it out.

This is what all my data is like. Trees of nested vectors of sub-records.

Can anyone advise on what the SQL might look like, or where there's an
example doing something like this I can learn from?

Thanks for any help

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-18 Thread via GitHub


mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1298815764


##
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpXmlOptions.java:
##
@@ -111,7 +111,7 @@ public String toString() {
   public static class HttpXmlOptionsBuilder {
 
 private int dataLevel;
-private boolean allTextMode;
+private Boolean allTextMode;

Review Comment:
   I thought there were 3 modes: allTextMode, allNumbersAreDouble mode, and 
infer-types mode. 
   
   So why is this a boolean vs am enum?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-18 Thread Charles Givre
Hey Mike,
So it looks like I was wrong and the XML reader does not have the support for 
Arrays.  However... Once DRILL-8450 is merged, I'll add the readers for arrays. 
  The XML reader itself still won't be able to dynamically detect them until we 
finish the XSD support, but at least the infra will be there.
Best,
-- C


> On Aug 15, 2023, at 11:39 PM, Charles Givre  wrote:
> 
> I stand corrected...  It does not look like the XML reader has any support 
> for arrays.
> -- C
> 
>> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
>> 
>> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
>> as "ARRAY". This works, however, only if the XML reader uses the
>> (very complex) EVF framework and has a way to control parsing based on the
>> data type (and to set the data type based on parsing). The JSON reader has
>> such an integration. Charles, did you do the work to add that kind of
>> dynamic state machine to the XML parser?
>> 
>> - Paul
>> 
>> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
>> 
>>> Hi Mike,
>>> It is theoretically possible but I don't have an example of the syntax.
>>> As you've probably figured out, Drill vectors have both a type and data
>>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>>> Thus, you could tell Drill via the inline schema that the data mode for a
>>> given field is REPEATED and that would be the Drill equivalent of an
>>> Array.  I've never actually done this, so I don't really know if it would
>>> work for inline schemata but I'd assume that it would.
>>> 
>>> I'll do some digging to see whether I have any examples of this.
>>> Best,
>>> --C
>>> 
>>> 
>>> 
>>> 
>>> 
 On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
 
 I'm trying to get my Drill SQL queries to produce the right thing from
>>> XML.
 
 A major thing that you can't easily infer from looking at just XML data
>>> is
 what is an array. XML lacks an array starting indicator.
 
 Is there an inline schema notation in the Drill Query language for
 array-ness, so that one can inform Drill what is an array?
 
 For example this provides simple types for all the fields directly in the
 query.
 
 @Test
 
 public void testSimpleProvidedSchema() throws Exception {
 
 String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
 (type => 'xml', schema " +
 
  "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
 FLOAT, `double_field` DOUBLE, `boolean_field` " +
 
  "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
 TIMESTAMP, `string_field`" +
 
  " VARCHAR, `date2_field` DATE properties {`drill.format` =
 `MM/dd/`})'))";
 
 RowSet results = client.queryBuilder().sql(sql).rowSet();
 
 assertEquals(2, results.rowCount());
 
 
 Can one also tell Drill what fields or child elements are arrays?
>>> 
>>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-18 Thread via GitHub


cgivre commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1684011222

   @mbeckerle Could you please take another look.  I had to fix a few things 
for a unit test.  Thx!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org