Re: Drill SQL questions - JSON context
Hi Mike, Good progress! There are a number of factors to consider. Let's work through them one by one. First, try the simplest possible query: SELECT * FROM If you are using the row set mechanism, grab the schema and print it. (My memory is hazy, but I do believe that there are methods and classes that will do this for you.) What you should see is the nested structure you created. The JSON reader has a super-complex parser that will work out structure and types based on the first value seen. In your example, it should guess VARCHAR and INT for your data items. Once you confirm that the JSON parser has correctly interpreted your data, you can move onto the second question: how SQL works with structured data. Here we have to realize that SQL wasn't designed for structured data: SQL only knows how to work with variables projected to the top level. This leads to a quantum-like result that observing a variable changes its structure. The JSON reader uses something called the Enhanced Vector Framework (EVF) to do the projecting at scan time. Let's work out what it should be doing. (The code is gawd-awful complex, so there is a chance that something might be broken.) In your query, you are projecting c.f.g to a top level variable g. Fields c and f are an array of maps. I can't recall testing this kind of projection, but I'd expect it to result in the projected variable g being an array or arrays: what Drill calls a repeated list. Although I wrote this stuff, I don't recall any code that will convert a repeated map into a repeated list: so this area may be a bit tender. Or, maybe it just punts and leaves the repeated map, but with a single entry? That wouldn't quite work. This could use a bit of testing. The third question is how to flatten rows. Flattening occurs via a separate flatten operator. You'd need to flatten twice: once for each level. This whole area is a bit hazy for me (I'm not super familiar with the details), but I suspect you'd need to use a set of nested SELECT statements, each of which flattens the outermost level, which will project the result to the top level where it can be manipulated by the SELECT at the next outer level. To try this, extend your SELECT * to select just a top-level field (a) and flatten a top-level repeated map (f). The result should be rows with a scalar and a repeated map. Then, add another level of SELECT to flatten the repeated map: you'll get a scalar and a map. Then, use yet another SELECT to pick out the map fields to top-level fields, and do the WHERE clause. I *think* that should more-or-less work. - Paul On Fri, Aug 18, 2023 at 2:01 PM Mike Beckerle wrote: > I'm using Apache Daffodil in the mode where it outputs JSON data. (For the > moment, until we build a tighter integration. This is my conceptual test > framework for that integration.) > > I have parsed data to create this JSON which represents 2-level nested > repeating subrecords. > > All the simple fields are int. > > [{"a":1, "b":2, "c":[{"d":3, "e":4, "f":[{"g":5, "h":6 }, > {"g":7, "h":8 }]}, >{"d":9, "e":10, "f":[{"g":11, "h":12}, > {"g":13, "h":14}]}]}, > {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 }, > {"g":27, "h":28 }]}, >{"d":29, "e":30, "f":[{"g":31, "h":32}, > {"g":33, "h":34}]}]}] > > So, the top level is a vector of maps, > within that, field "c" is a vector of maps, > and within "c" is a field f which is a vector of maps. > > The reason I created this is I'm trying to understand the arrays and how > they work with Drill SQL. > > I'm trying to figure out how to get this rowset of 3 rows from a query, and > I'm stumped. > > a b d e g h > ( 1, 2, 3, 4, 5, 6) > ( 1, 2, 9, 10, 13, 14) > (21, 22, 29, 30, 33, 34) > > This is the SQL that is my conceptual framework, but I'm sure it won't > work. > > SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h > FROM ... the json file... > WHERE g mod 10 == 3 OR g == 5 > > But I know it's not going to be that easy to get the query to traverse the > vector inside the vector. > > From the doc, the FLATTEN operator seems to be needed, but I can't really > figure it out. > > This is what all my data is like. Trees of nested vectors of sub-records. > > Can anyone advise on what the SQL might look like, or where there's an > example doing something like this I can learn from? > > Thanks for any help > > Mike Beckerle > Apache Daffodil PMC | daffodil.apache.org > OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > Owl Cyber Defense | www.owlcyberdefense.com >
Drill SQL questions - JSON context
I'm using Apache Daffodil in the mode where it outputs JSON data. (For the moment, until we build a tighter integration. This is my conceptual test framework for that integration.) I have parsed data to create this JSON which represents 2-level nested repeating subrecords. All the simple fields are int. [{"a":1, "b":2, "c":[{"d":3, "e":4, "f":[{"g":5, "h":6 }, {"g":7, "h":8 }]}, {"d":9, "e":10, "f":[{"g":11, "h":12}, {"g":13, "h":14}]}]}, {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 }, {"g":27, "h":28 }]}, {"d":29, "e":30, "f":[{"g":31, "h":32}, {"g":33, "h":34}]}]}] So, the top level is a vector of maps, within that, field "c" is a vector of maps, and within "c" is a field f which is a vector of maps. The reason I created this is I'm trying to understand the arrays and how they work with Drill SQL. I'm trying to figure out how to get this rowset of 3 rows from a query, and I'm stumped. a b d e g h ( 1, 2, 3, 4, 5, 6) ( 1, 2, 9, 10, 13, 14) (21, 22, 29, 30, 33, 34) This is the SQL that is my conceptual framework, but I'm sure it won't work. SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h FROM ... the json file... WHERE g mod 10 == 3 OR g == 5 But I know it's not going to be that easy to get the query to traverse the vector inside the vector. >From the doc, the FLATTEN operator seems to be needed, but I can't really figure it out. This is what all my data is like. Trees of nested vectors of sub-records. Can anyone advise on what the SQL might look like, or where there's an example doing something like this I can learn from? Thanks for any help Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com
Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)
mbeckerle commented on code in PR #2819: URL: https://github.com/apache/drill/pull/2819#discussion_r1298815764 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpXmlOptions.java: ## @@ -111,7 +111,7 @@ public String toString() { public static class HttpXmlOptionsBuilder { private int dataLevel; -private boolean allTextMode; +private Boolean allTextMode; Review Comment: I thought there were 3 modes: allTextMode, allNumbersAreDouble mode, and infer-types mode. So why is this a boolean vs am enum? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: is there a way to provide inline array metadata to inform the xml_reader?
Hey Mike, So it looks like I was wrong and the XML reader does not have the support for Arrays. However... Once DRILL-8450 is merged, I'll add the readers for arrays. The XML reader itself still won't be able to dynamically detect them until we finish the XSD support, but at least the infra will be there. Best, -- C > On Aug 15, 2023, at 11:39 PM, Charles Givre wrote: > > I stand corrected... It does not look like the XML reader has any support > for arrays. > -- C > >> On Aug 15, 2023, at 12:01 AM, Paul Rogers wrote: >> >> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such >> as "ARRAY". This works, however, only if the XML reader uses the >> (very complex) EVF framework and has a way to control parsing based on the >> data type (and to set the data type based on parsing). The JSON reader has >> such an integration. Charles, did you do the work to add that kind of >> dynamic state machine to the XML parser? >> >> - Paul >> >> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre wrote: >> >>> Hi Mike, >>> It is theoretically possible but I don't have an example of the syntax. >>> As you've probably figured out, Drill vectors have both a type and data >>> mode. The mode is either NULLABLE or REPEATED if I remember correctly. >>> Thus, you could tell Drill via the inline schema that the data mode for a >>> given field is REPEATED and that would be the Drill equivalent of an >>> Array. I've never actually done this, so I don't really know if it would >>> work for inline schemata but I'd assume that it would. >>> >>> I'll do some digging to see whether I have any examples of this. >>> Best, >>> --C >>> >>> >>> >>> >>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: I'm trying to get my Drill SQL queries to produce the right thing from >>> XML. A major thing that you can't easily infer from looking at just XML data >>> is what is an array. XML lacks an array starting indicator. Is there an inline schema notation in the Drill Query language for array-ness, so that one can inform Drill what is an array? For example this provides simple types for all the fields directly in the query. @Test public void testSimpleProvidedSchema() throws Exception { String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` (type => 'xml', schema " + "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` FLOAT, `double_field` DOUBLE, `boolean_field` " + "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` TIMESTAMP, `string_field`" + " VARCHAR, `date2_field` DATE properties {`drill.format` = `MM/dd/`})'))"; RowSet results = client.queryBuilder().sql(sql).rowSet(); assertEquals(2, results.rowCount()); Can one also tell Drill what fields or child elements are arrays? >>> >>> > signature.asc Description: Message signed with OpenPGP
Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)
cgivre commented on PR #2819: URL: https://github.com/apache/drill/pull/2819#issuecomment-1684011222 @mbeckerle Could you please take another look. I had to fix a few things for a unit test. Thx! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org