Re: is there a way to provide inline array metadata to inform the xml_reader?
Hey Mike, So it looks like I was wrong and the XML reader does not have the support for Arrays. However... Once DRILL-8450 is merged, I'll add the readers for arrays. The XML reader itself still won't be able to dynamically detect them until we finish the XSD support, but at least the infra will be there. Best, -- C > On Aug 15, 2023, at 11:39 PM, Charles Givre wrote: > > I stand corrected... It does not look like the XML reader has any support > for arrays. > -- C > >> On Aug 15, 2023, at 12:01 AM, Paul Rogers wrote: >> >> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such >> as "ARRAY". This works, however, only if the XML reader uses the >> (very complex) EVF framework and has a way to control parsing based on the >> data type (and to set the data type based on parsing). The JSON reader has >> such an integration. Charles, did you do the work to add that kind of >> dynamic state machine to the XML parser? >> >> - Paul >> >> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre wrote: >> >>> Hi Mike, >>> It is theoretically possible but I don't have an example of the syntax. >>> As you've probably figured out, Drill vectors have both a type and data >>> mode. The mode is either NULLABLE or REPEATED if I remember correctly. >>> Thus, you could tell Drill via the inline schema that the data mode for a >>> given field is REPEATED and that would be the Drill equivalent of an >>> Array. I've never actually done this, so I don't really know if it would >>> work for inline schemata but I'd assume that it would. >>> >>> I'll do some digging to see whether I have any examples of this. >>> Best, >>> --C >>> >>> >>> >>> >>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: I'm trying to get my Drill SQL queries to produce the right thing from >>> XML. A major thing that you can't easily infer from looking at just XML data >>> is what is an array. XML lacks an array starting indicator. Is there an inline schema notation in the Drill Query language for array-ness, so that one can inform Drill what is an array? For example this provides simple types for all the fields directly in the query. @Test public void testSimpleProvidedSchema() throws Exception { String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` (type => 'xml', schema " + "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` FLOAT, `double_field` DOUBLE, `boolean_field` " + "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` TIMESTAMP, `string_field`" + " VARCHAR, `date2_field` DATE properties {`drill.format` = `MM/dd/`})'))"; RowSet results = client.queryBuilder().sql(sql).rowSet(); assertEquals(2, results.rowCount()); Can one also tell Drill what fields or child elements are arrays? >>> >>> > signature.asc Description: Message signed with OpenPGP
Re: is there a way to provide inline array metadata to inform the xml_reader?
I stand corrected... It does not look like the XML reader has any support for arrays. -- C > On Aug 15, 2023, at 12:01 AM, Paul Rogers wrote: > > IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such > as "ARRAY". This works, however, only if the XML reader uses the > (very complex) EVF framework and has a way to control parsing based on the > data type (and to set the data type based on parsing). The JSON reader has > such an integration. Charles, did you do the work to add that kind of > dynamic state machine to the XML parser? > > - Paul > > On Mon, Aug 14, 2023 at 6:28 PM Charles Givre wrote: > >> Hi Mike, >> It is theoretically possible but I don't have an example of the syntax. >> As you've probably figured out, Drill vectors have both a type and data >> mode. The mode is either NULLABLE or REPEATED if I remember correctly. >> Thus, you could tell Drill via the inline schema that the data mode for a >> given field is REPEATED and that would be the Drill equivalent of an >> Array. I've never actually done this, so I don't really know if it would >> work for inline schemata but I'd assume that it would. >> >> I'll do some digging to see whether I have any examples of this. >> Best, >> --C >> >> >> >> >> >>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: >>> >>> I'm trying to get my Drill SQL queries to produce the right thing from >> XML. >>> >>> A major thing that you can't easily infer from looking at just XML data >> is >>> what is an array. XML lacks an array starting indicator. >>> >>> Is there an inline schema notation in the Drill Query language for >>> array-ness, so that one can inform Drill what is an array? >>> >>> For example this provides simple types for all the fields directly in the >>> query. >>> >>> @Test >>> >>> public void testSimpleProvidedSchema() throws Exception { >>> >>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` >>> (type => 'xml', schema " + >>> >>> "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` >>> FLOAT, `double_field` DOUBLE, `boolean_field` " + >>> >>> "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` >>> TIMESTAMP, `string_field`" + >>> >>> " VARCHAR, `date2_field` DATE properties {`drill.format` = >>> `MM/dd/`})'))"; >>> >>> RowSet results = client.queryBuilder().sql(sql).rowSet(); >>> >>> assertEquals(2, results.rowCount()); >>> >>> >>> Can one also tell Drill what fields or child elements are arrays? >> >> signature.asc Description: Message signed with OpenPGP
Re: is there a way to provide inline array metadata to inform the xml_reader?
Hey Paul, The XML reader was implemented using the EVF2 Framework and in theory does have writers for repeated data types. I'm not sure to what extent this has been tested. Best, -- C > On Aug 15, 2023, at 12:01 AM, Paul Rogers wrote: > > IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such > as "ARRAY". This works, however, only if the XML reader uses the > (very complex) EVF framework and has a way to control parsing based on the > data type (and to set the data type based on parsing). The JSON reader has > such an integration. Charles, did you do the work to add that kind of > dynamic state machine to the XML parser? > > - Paul > > On Mon, Aug 14, 2023 at 6:28 PM Charles Givre wrote: > >> Hi Mike, >> It is theoretically possible but I don't have an example of the syntax. >> As you've probably figured out, Drill vectors have both a type and data >> mode. The mode is either NULLABLE or REPEATED if I remember correctly. >> Thus, you could tell Drill via the inline schema that the data mode for a >> given field is REPEATED and that would be the Drill equivalent of an >> Array. I've never actually done this, so I don't really know if it would >> work for inline schemata but I'd assume that it would. >> >> I'll do some digging to see whether I have any examples of this. >> Best, >> --C >> >> >> >> >> >>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: >>> >>> I'm trying to get my Drill SQL queries to produce the right thing from >> XML. >>> >>> A major thing that you can't easily infer from looking at just XML data >> is >>> what is an array. XML lacks an array starting indicator. >>> >>> Is there an inline schema notation in the Drill Query language for >>> array-ness, so that one can inform Drill what is an array? >>> >>> For example this provides simple types for all the fields directly in the >>> query. >>> >>> @Test >>> >>> public void testSimpleProvidedSchema() throws Exception { >>> >>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` >>> (type => 'xml', schema " + >>> >>> "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` >>> FLOAT, `double_field` DOUBLE, `boolean_field` " + >>> >>> "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` >>> TIMESTAMP, `string_field`" + >>> >>> " VARCHAR, `date2_field` DATE properties {`drill.format` = >>> `MM/dd/`})'))"; >>> >>> RowSet results = client.queryBuilder().sql(sql).rowSet(); >>> >>> assertEquals(2, results.rowCount()); >>> >>> >>> Can one also tell Drill what fields or child elements are arrays? >> >> signature.asc Description: Message signed with OpenPGP
Re: is there a way to provide inline array metadata to inform the xml_reader?
IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such as "ARRAY". This works, however, only if the XML reader uses the (very complex) EVF framework and has a way to control parsing based on the data type (and to set the data type based on parsing). The JSON reader has such an integration. Charles, did you do the work to add that kind of dynamic state machine to the XML parser? - Paul On Mon, Aug 14, 2023 at 6:28 PM Charles Givre wrote: > Hi Mike, > It is theoretically possible but I don't have an example of the syntax. > As you've probably figured out, Drill vectors have both a type and data > mode. The mode is either NULLABLE or REPEATED if I remember correctly. > Thus, you could tell Drill via the inline schema that the data mode for a > given field is REPEATED and that would be the Drill equivalent of an > Array. I've never actually done this, so I don't really know if it would > work for inline schemata but I'd assume that it would. > > I'll do some digging to see whether I have any examples of this. > Best, > --C > > > > > > > On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: > > > > I'm trying to get my Drill SQL queries to produce the right thing from > XML. > > > > A major thing that you can't easily infer from looking at just XML data > is > > what is an array. XML lacks an array starting indicator. > > > > Is there an inline schema notation in the Drill Query language for > > array-ness, so that one can inform Drill what is an array? > > > > For example this provides simple types for all the fields directly in the > > query. > > > > @Test > > > > public void testSimpleProvidedSchema() throws Exception { > > > > String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` > > (type => 'xml', schema " + > > > >"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` > > FLOAT, `double_field` DOUBLE, `boolean_field` " + > > > >"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` > > TIMESTAMP, `string_field`" + > > > >" VARCHAR, `date2_field` DATE properties {`drill.format` = > > `MM/dd/`})'))"; > > > > RowSet results = client.queryBuilder().sql(sql).rowSet(); > > > > assertEquals(2, results.rowCount()); > > > > > > Can one also tell Drill what fields or child elements are arrays? > >
Re: is there a way to provide inline array metadata to inform the xml_reader?
Hi Mike, It is theoretically possible but I don't have an example of the syntax. As you've probably figured out, Drill vectors have both a type and data mode. The mode is either NULLABLE or REPEATED if I remember correctly. Thus, you could tell Drill via the inline schema that the data mode for a given field is REPEATED and that would be the Drill equivalent of an Array. I've never actually done this, so I don't really know if it would work for inline schemata but I'd assume that it would. I'll do some digging to see whether I have any examples of this. Best, --C > On Aug 14, 2023, at 3:36 PM, Mike Beckerle wrote: > > I'm trying to get my Drill SQL queries to produce the right thing from XML. > > A major thing that you can't easily infer from looking at just XML data is > what is an array. XML lacks an array starting indicator. > > Is there an inline schema notation in the Drill Query language for > array-ness, so that one can inform Drill what is an array? > > For example this provides simple types for all the fields directly in the > query. > > @Test > > public void testSimpleProvidedSchema() throws Exception { > > String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` > (type => 'xml', schema " + > >"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` > FLOAT, `double_field` DOUBLE, `boolean_field` " + > >"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` > TIMESTAMP, `string_field`" + > >" VARCHAR, `date2_field` DATE properties {`drill.format` = > `MM/dd/`})'))"; > > RowSet results = client.queryBuilder().sql(sql).rowSet(); > > assertEquals(2, results.rowCount()); > > > Can one also tell Drill what fields or child elements are arrays? signature.asc Description: Message signed with OpenPGP
is there a way to provide inline array metadata to inform the xml_reader?
I'm trying to get my Drill SQL queries to produce the right thing from XML. A major thing that you can't easily infer from looking at just XML data is what is an array. XML lacks an array starting indicator. Is there an inline schema notation in the Drill Query language for array-ness, so that one can inform Drill what is an array? For example this provides simple types for all the fields directly in the query. @Test public void testSimpleProvidedSchema() throws Exception { String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml` (type => 'xml', schema " + "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field` FLOAT, `double_field` DOUBLE, `boolean_field` " + "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field` TIMESTAMP, `string_field`" + " VARCHAR, `date2_field` DATE properties {`drill.format` = `MM/dd/`})'))"; RowSet results = client.queryBuilder().sql(sql).rowSet(); assertEquals(2, results.rowCount()); Can one also tell Drill what fields or child elements are arrays?