Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning
XML will never die. The Cobol programmers were reincarnated and built
similarly long-lasting generators of XML.

If you have a schema, then it is a reasonable format for Drill to parse, if
only to turn around and write to another format.



On Wed, Apr 6, 2022 at 7:31 PM Paul Rogers  wrote:

> Hi Luoc,
>
> First, what poor soul is asked to deal with large amounts of XML in this
> day and age? I thought we were past the XML madness, except in Maven and
> Hadoop config files.
>
> XML is much like JSON, only worse. JSON at least has well-defined types
> that can be gleaned from JSON syntax. With XML...? Anything goes because
> XML is a document mark-up language, not a data structure description
> language.
>
> The classic problem with XML is that if XML is used to describe a
> reasonable data structure (rows and columns), then it can reasonably be
> parsed into rows and columns. If XML represents a document (or a
> relationship graph), then there is no good mapping to rows and columns.
> This was true 20 years ago and it is true today.
>
> So, suppose your XML represents row-like data. Then an XML parser could
> hope for the best and make a good guess at the types and structure. The XML
> parser could work like the new & improved JSON parser (based on EVF2) which
> Vitalii is working on. (I did the original work and Vitalli has the
> thankless task of updating that work to match the current code.) That JSON
> parser is VERY complex as it infers types on the fly. Quick, what type is
> "a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
> {"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"
>
> An XML parser could use the same tricks. In fact, it can probably use the
> same code. In JSON, the parser sends events, and the Drill code does its
> type inference magic based on those events. An XML parser can emit similar
> events, and make similar decisions.
>
> As you noted, if we have a DTD, we don't have to do schema inference. But,
> we do have to do DTD-to-rows-and-columns inference. Once do that, we use
> the provided schema as you suggested. (The JSON reader I mentioned already
> supports a provided schema to add sanity to the otherwise crazy JSON type
> inference process when data is sparse and changing.)
>
> In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
> make those same decisions. Hopefully someone has already done that and
> users would be willing to use that fancy tool to convert their XML to JSON
> before using Drill. (Of course, if they want good performance, they should
> have converted XML to Parquet instead.)
>
> So, rather than have a super-fancy Drill XML reader, maybe find a
> super-fancy XML-to-Parquet converter, use that once, and then let Drill
> quickly query Parquet. The results will be much better than trying to parse
> XML over and over on each query. Just because we *can* do it doesn't mean
> we *should*.
>
> Thanks,
>
> - Paul
>
>
>
> On Wed, Apr 6, 2022 at 5:01 AM luoc  wrote:
>
> > 
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB,
> > 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the Drill
> > drive to deduce the structure of the data. (or called *SCHEMA*), the code
> > will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So, "provided
> > schema" or "TO_JSON" may be good medicine :
> >
> > *Provided Schema*
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can build
> > all value vectors (Writer) before reading data, solving the fields,
> types,
> > and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> know
> > if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> vectors
> > as needed during the reading phase.
> >
> > *TO_JSON*
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> requires
> > reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
>


Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Lee, David
I wrote a xml to parquet converter as well.

It basically extends xml to json. Writes the json data into a memory mapped 
file. Reads the memory mapped file into an Apache Arrow columnar table. Saves 
the Arrow table as a parquet file..

https://github.com/davlee1972/xml_to_json
https://arrow.apache.org/docs/python/json.html
https://arrow.apache.org/docs/python/parquet.html

I deal with some data feeds which are 2 TB worth of XML a month and these XML 
files have 5,000 different elements with nesting levels that go 10 levels deep.

On Apr 6, 2022, at 7:30 PM, Paul Rogers  wrote:

External Email: Use caution with links and attachments


Hi Luoc,

First, what poor soul is asked to deal with large amounts of XML in this
day and age? I thought we were past the XML madness, except in Maven and
Hadoop config files.

XML is much like JSON, only worse. JSON at least has well-defined types
that can be gleaned from JSON syntax. With XML...? Anything goes because
XML is a document mark-up language, not a data structure description
language.

The classic problem with XML is that if XML is used to describe a
reasonable data structure (rows and columns), then it can reasonably be
parsed into rows and columns. If XML represents a document (or a
relationship graph), then there is no good mapping to rows and columns.
This was true 20 years ago and it is true today.

So, suppose your XML represents row-like data. Then an XML parser could
hope for the best and make a good guess at the types and structure. The XML
parser could work like the new & improved JSON parser (based on EVF2) which
Vitalii is working on. (I did the original work and Vitalli has the
thankless task of updating that work to match the current code.) That JSON
parser is VERY complex as it infers types on the fly. Quick, what type is
"a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
{"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"

An XML parser could use the same tricks. In fact, it can probably use the
same code. In JSON, the parser sends events, and the Drill code does its
type inference magic based on those events. An XML parser can emit similar
events, and make similar decisions.

As you noted, if we have a DTD, we don't have to do schema inference. But,
we do have to do DTD-to-rows-and-columns inference. Once do that, we use
the provided schema as you suggested. (The JSON reader I mentioned already
supports a provided schema to add sanity to the otherwise crazy JSON type
inference process when data is sparse and changing.)

In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
make those same decisions. Hopefully someone has already done that and
users would be willing to use that fancy tool to convert their XML to JSON
before using Drill. (Of course, if they want good performance, they should
have converted XML to Parquet instead.)

So, rather than have a super-fancy Drill XML reader, maybe find a
super-fancy XML-to-Parquet converter, use that once, and then let Drill
quickly query Parquet. The results will be much better than trying to parse
XML over and over on each query. Just because we *can* do it doesn't mean
we *should*.

Thanks,

- Paul



On Wed, Apr 6, 2022 at 5:01 AM luoc  wrote:


Hello dear driller,

Before starting the topic, I would like to do a simple survey :

1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB,
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema
Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill
drive to deduce the structure of the data. (or called *SCHEMA*), the code
will get very complex and delicate.

For example, inferring array structure and numeric range. So, "provided
schema" or "TO_JSON" may be good medicine :

*Provided Schema*

We can add the DTD or XML Schema (XSD) support for the XML. It can build
all value vectors (Writer) before reading data, solving the fields, types,
and complex nested.

However, a definition file is actually a rule validator that allows
elements to appear 0 or more times. As a result, it is not possible to know
if all elements exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not
actually exist before reading the data.

We can build the top schema at the initial stage and add new value vectors
as needed during the reading phase.

*TO_JSON*

Read and convert XML directly to JSON, using the JSON Reader for data
resolution.

It makes it easier for us to query the XML data such as JSON, but requires
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 

Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Paul Rogers
Hi Luoc,

First, what poor soul is asked to deal with large amounts of XML in this
day and age? I thought we were past the XML madness, except in Maven and
Hadoop config files.

XML is much like JSON, only worse. JSON at least has well-defined types
that can be gleaned from JSON syntax. With XML...? Anything goes because
XML is a document mark-up language, not a data structure description
language.

The classic problem with XML is that if XML is used to describe a
reasonable data structure (rows and columns), then it can reasonably be
parsed into rows and columns. If XML represents a document (or a
relationship graph), then there is no good mapping to rows and columns.
This was true 20 years ago and it is true today.

So, suppose your XML represents row-like data. Then an XML parser could
hope for the best and make a good guess at the types and structure. The XML
parser could work like the new & improved JSON parser (based on EVF2) which
Vitalii is working on. (I did the original work and Vitalli has the
thankless task of updating that work to match the current code.) That JSON
parser is VERY complex as it infers types on the fly. Quick, what type is
"a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
{"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"

An XML parser could use the same tricks. In fact, it can probably use the
same code. In JSON, the parser sends events, and the Drill code does its
type inference magic based on those events. An XML parser can emit similar
events, and make similar decisions.

As you noted, if we have a DTD, we don't have to do schema inference. But,
we do have to do DTD-to-rows-and-columns inference. Once do that, we use
the provided schema as you suggested. (The JSON reader I mentioned already
supports a provided schema to add sanity to the otherwise crazy JSON type
inference process when data is sparse and changing.)

In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
make those same decisions. Hopefully someone has already done that and
users would be willing to use that fancy tool to convert their XML to JSON
before using Drill. (Of course, if they want good performance, they should
have converted XML to Parquet instead.)

So, rather than have a super-fancy Drill XML reader, maybe find a
super-fancy XML-to-Parquet converter, use that once, and then let Drill
quickly query Parquet. The results will be much better than trying to parse
XML over and over on each query. Just because we *can* do it doesn't mean
we *should*.

Thanks,

- Paul



On Wed, Apr 6, 2022 at 5:01 AM luoc  wrote:

> 
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read? 1MB,
> 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the Drill
> drive to deduce the structure of the data. (or called *SCHEMA*), the code
> will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, "provided
> schema" or "TO_JSON" may be good medicine :
>
> *Provided Schema*
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can build
> all value vectors (Writer) before reading data, solving the fields, types,
> and complex nested.
>
> However, a definition file is actually a rule validator that allows
> elements to appear 0 or more times. As a result, it is not possible to know
> if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value vectors
> as needed during the reading phase.
>
> *TO_JSON*
>
> Read and convert XML directly to JSON, using the JSON Reader for data
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but requires
> reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>


Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Charles Givre
I'll add to this thread as the developer of the XML plugin for Drill.  IMHO, I 
think it would be a very good idea to add XSD schema support.  I've not had 
time to really dig into that, but it would seem like writing a converter from 
XSD to Drill's TupleMetadata would be relatively straightforward.  Then we'd 
have to make sure the schema provisioning works, which actually isn't that 
hard.  I started looking into the first part, but got side tracked.  In any 
event, having the schema from an XSD would eliminate the ambiguity in XML files.

As a side benefit, this would allow for easy conversion from XML to JSON.  You 
could simply do a CTAS query on an XML file and output JSON.

Best,
-- C


> On Apr 6, 2022, at 5:41 PM, Ted Dunning  wrote:
> 
> And if there are zero instances what happens (curiosity here)?
> 
> 
> 
> On Wed, Apr 6, 2022 at 12:28 PM Lee, David 
> wrote:
> 
>> Which is why using a XSD is more or less full proof..
>> 
>> If the pet element is tagged with maxOccurs="unbounded" it implies it
>> should be saved as an array even if there is just one occurrence of 
>> in your data.
>> 
>> -Original Message-
>> From: Ted Dunning 
>> Sent: Wednesday, April 6, 2022 11:48 AM
>> To: dev 
>> Cc: u...@drill.apache.org
>> Subject: Re: [DISCUSS] Add schema support for the XML format
>> 
>> External Email: Use caution with links and attachments
>> 
>> 
>> That example:
>> 
>> dog
>>> cat
>> 
>> 
>> can also convert to ["pet":"dog", "pet":"dog']
>> 
>> XML is rife with problems like this.
>> 
>> As you say.
>> 
>> But worse than can be imagined unless you have been hit by these problems.
>> 
>> On Wed, Apr 6, 2022 at 11:39 AM Lee, David > .invalid>
>> wrote:
>> 
>>> TO_JSON won't work in cases where..
>>> 
>>> One file contains: dog which converts to {"pet":"dog"}
>>> 
>>> But another file contains:
>>> dog
>>> cat
>>> which converts to: {"pet": ["dog", "cat"]}
>>> 
>>> pet as a column in Drill can't be both a varchar and an array of
>>> varchar
>>> 
>>> There are a ton of gotcha(s) when dealing with XML..
>>> numeric vs string
>>> scalar vs array
>>> 
>>> -Original Message-
>>> From: Lee, David
>>> Sent: Wednesday, April 6, 2022 10:54 AM
>>> To: u...@drill.apache.org; dev@drill.apache.org
>>> Subject: RE: [DISCUSS] Add schema support for the XML format
>>> 
>>> I wrote something to convert XML to JSON using an XSD schema file to
>>> solve fields, types, nested structures, etc.. It's the only real way
>>> to ensure column level data integrity.
>>> 
>>> https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
>>> _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
>>> HnWC64OWFHods$
>>> 
>>> Converts XML to valid JSON or JSONL Requires only two files to get
>>> started. Your XML file and the XSD schema file for that XML file.
>>> 
>>> -Original Message-
>>> From: luoc 
>>> Sent: Wednesday, April 6, 2022 5:01 AM
>>> To: u...@drill.apache.org; dev@drill.apache.org
>>> Subject: [DISCUSS] Add schema support for the XML format
>>> 
>>> External Email: Use caution with links and attachments
>>> 
>>> 
>>> Hello dear driller,
>>> 
>>> Before starting the topic, I would like to do a simple survey :
>>> 
>>> 1. Did you know that Drill already supports XML format?
>>> 
>>> 2. If yes, what is the maximum size for the XML files you normally read?
>>> 1MB, 10MB or 100MB
>>> 
>>> 3. Do you expect that reading XML will be as easy as JSON (Schema
>>> Discovery)?
>>> 
>>> Thank you for responding to those questions.
>>> 
>>> XML is different from the JSON file, and if we rely solely on the
>>> Drill drive to deduce the structure of the data. (or called SCHEMA),
>>> the code will get very complex and delicate.
>>> 
>>> For example, inferring array structure and numeric range. So,
>>> "provided schema" or "TO_JSON" may be good medicine :
>>> 
>>> Provided Schema
>>> 
>>> We can add the DTD or XML Schema (XSD) support for the XML. It can
>>> build all value vectors (Writer) before reading data, solving the
>>> fields, types, 

Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning
And if there are zero instances what happens (curiosity here)?



On Wed, Apr 6, 2022 at 12:28 PM Lee, David 
wrote:

> Which is why using a XSD is more or less full proof..
>
> If the pet element is tagged with maxOccurs="unbounded" it implies it
> should be saved as an array even if there is just one occurrence of 
> in your data.
>
> -Original Message-
> From: Ted Dunning 
> Sent: Wednesday, April 6, 2022 11:48 AM
> To: dev 
> Cc: u...@drill.apache.org
> Subject: Re: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> That example:
>
> dog
> > cat
>
>
> can also convert to ["pet":"dog", "pet":"dog']
>
> XML is rife with problems like this.
>
> As you say.
>
> But worse than can be imagined unless you have been hit by these problems.
>
> On Wed, Apr 6, 2022 at 11:39 AM Lee, David  .invalid>
> wrote:
>
> > TO_JSON won't work in cases where..
> >
> > One file contains: dog which converts to {"pet":"dog"}
> >
> > But another file contains:
> > dog
> > cat
> > which converts to: {"pet": ["dog", "cat"]}
> >
> > pet as a column in Drill can't be both a varchar and an array of
> > varchar
> >
> > There are a ton of gotcha(s) when dealing with XML..
> > numeric vs string
> > scalar vs array
> >
> > -Original Message-
> > From: Lee, David
> > Sent: Wednesday, April 6, 2022 10:54 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: RE: [DISCUSS] Add schema support for the XML format
> >
> > I wrote something to convert XML to JSON using an XSD schema file to
> > solve fields, types, nested structures, etc.. It's the only real way
> > to ensure column level data integrity.
> >
> > https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
> > _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
> > HnWC64OWFHods$
> >
> > Converts XML to valid JSON or JSONL Requires only two files to get
> > started. Your XML file and the XSD schema file for that XML file.
> >
> > -Original Message-
> > From: luoc 
> > Sent: Wednesday, April 6, 2022 5:01 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: [DISCUSS] Add schema support for the XML format
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> > 1MB, 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the
> > Drill drive to deduce the structure of the data. (or called SCHEMA),
> > the code will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So,
> > "provided schema" or "TO_JSON" may be good medicine :
> >
> > Provided Schema
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can
> > build all value vectors (Writer) before reading data, solving the
> > fields, types, and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> > know if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> > vectors as needed during the reading phase.
> >
> > TO_JSON
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> > requires reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
> >
>


RE: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Lee, David
Which is why using a XSD is more or less full proof..

If the pet element is tagged with maxOccurs="unbounded" it implies it should be 
saved as an array even if there is just one occurrence of  in your data.

-Original Message-
From: Ted Dunning  
Sent: Wednesday, April 6, 2022 11:48 AM
To: dev 
Cc: u...@drill.apache.org
Subject: Re: [DISCUSS] Add schema support for the XML format

External Email: Use caution with links and attachments


That example:

dog
> cat


can also convert to ["pet":"dog", "pet":"dog']

XML is rife with problems like this.

As you say.

But worse than can be imagined unless you have been hit by these problems.

On Wed, Apr 6, 2022 at 11:39 AM Lee, David 
wrote:

> TO_JSON won't work in cases where..
>
> One file contains: dog which converts to {"pet":"dog"}
>
> But another file contains:
> dog
> cat
> which converts to: {"pet": ["dog", "cat"]}
>
> pet as a column in Drill can't be both a varchar and an array of 
> varchar
>
> There are a ton of gotcha(s) when dealing with XML..
> numeric vs string
> scalar vs array
>
> -Original Message-----
> From: Lee, David
> Sent: Wednesday, April 6, 2022 10:54 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: RE: [DISCUSS] Add schema support for the XML format
>
> I wrote something to convert XML to JSON using an XSD schema file to 
> solve fields, types, nested structures, etc.. It's the only real way 
> to ensure column level data integrity.
>
> https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
> _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
> HnWC64OWFHods$
>
> Converts XML to valid JSON or JSONL Requires only two files to get 
> started. Your XML file and the XSD schema file for that XML file.
>
> -Original Message-
> From: luoc 
> Sent: Wednesday, April 6, 2022 5:01 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB, 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema 
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the 
> Drill drive to deduce the structure of the data. (or called SCHEMA), 
> the code will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, 
> "provided schema" or "TO_JSON" may be good medicine :
>
> Provided Schema
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can 
> build all value vectors (Writer) before reading data, solving the 
> fields, types, and complex nested.
>
> However, a definition file is actually a rule validator that allows 
> elements to appear 0 or more times. As a result, it is not possible to 
> know if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not 
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value 
> vectors as needed during the reading phase.
>
> TO_JSON
>
> Read and convert XML directly to JSON, using the JSON Reader for data 
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but 
> requires reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for 
> further information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>


Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning
That example:

dog
> cat


can also convert to ["pet":"dog", "pet":"dog']

XML is rife with problems like this.

As you say.

But worse than can be imagined unless you have been hit by these problems.

On Wed, Apr 6, 2022 at 11:39 AM Lee, David 
wrote:

> TO_JSON won't work in cases where..
>
> One file contains: dog which converts to {"pet":"dog"}
>
> But another file contains:
> dog
> cat
> which converts to: {"pet": ["dog", "cat"]}
>
> pet as a column in Drill can't be both a varchar and an array of varchar
>
> There are a ton of gotcha(s) when dealing with XML..
> numeric vs string
> scalar vs array
>
> -Original Message-
> From: Lee, David
> Sent: Wednesday, April 6, 2022 10:54 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: RE: [DISCUSS] Add schema support for the XML format
>
> I wrote something to convert XML to JSON using an XSD schema file to solve
> fields, types, nested structures, etc.. It's the only real way to ensure
> column level data integrity.
>
> https://github.com/davlee1972/xml_to_json
>
> Converts XML to valid JSON or JSONL Requires only two files to get
> started. Your XML file and the XSD schema file for that XML file.
>
> -----Original Message-
> From: luoc 
> Sent: Wednesday, April 6, 2022 5:01 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB, 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the Drill
> drive to deduce the structure of the data. (or called SCHEMA), the code
> will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, "provided
> schema" or "TO_JSON" may be good medicine :
>
> Provided Schema
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can build
> all value vectors (Writer) before reading data, solving the fields, types,
> and complex nested.
>
> However, a definition file is actually a rule validator that allows
> elements to appear 0 or more times. As a result, it is not possible to know
> if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value vectors
> as needed during the reading phase.
>
> TO_JSON
>
> Read and convert XML directly to JSON, using the JSON Reader for data
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but requires
> reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>


RE: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Lee, David
TO_JSON won't work in cases where..

One file contains: dog which converts to {"pet":"dog"}

But another file contains:
dog
cat
which converts to: {"pet": ["dog", "cat"]}

pet as a column in Drill can't be both a varchar and an array of varchar

There are a ton of gotcha(s) when dealing with XML..
numeric vs string
scalar vs array

-Original Message-
From: Lee, David 
Sent: Wednesday, April 6, 2022 10:54 AM
To: u...@drill.apache.org; dev@drill.apache.org
Subject: RE: [DISCUSS] Add schema support for the XML format

I wrote something to convert XML to JSON using an XSD schema file to solve 
fields, types, nested structures, etc.. It's the only real way to ensure column 
level data integrity.

https://github.com/davlee1972/xml_to_json

Converts XML to valid JSON or JSONL Requires only two files to get started. 
Your XML file and the XSD schema file for that XML file.

-Original Message-
From: luoc  
Sent: Wednesday, April 6, 2022 5:01 AM
To: u...@drill.apache.org; dev@drill.apache.org
Subject: [DISCUSS] Add schema support for the XML format

External Email: Use caution with links and attachments


Hello dear driller,

Before starting the topic, I would like to do a simple survey :

1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB, 
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill drive 
to deduce the structure of the data. (or called SCHEMA), the code will get very 
complex and delicate.

For example, inferring array structure and numeric range. So, "provided schema" 
or "TO_JSON" may be good medicine :

Provided Schema

We can add the DTD or XML Schema (XSD) support for the XML. It can build all 
value vectors (Writer) before reading data, solving the fields, types, and 
complex nested.

However, a definition file is actually a rule validator that allows elements to 
appear 0 or more times. As a result, it is not possible to know if all elements 
exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not actually 
exist before reading the data.

We can build the top schema at the initial stage and add new value vectors as 
needed during the reading phase.

TO_JSON

Read and convert XML directly to JSON, using the JSON Reader for data 
resolution.

It makes it easier for us to query the XML data such as JSON, but requires 
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.


RE: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Lee, David
I wrote something to convert XML to JSON using an XSD schema file to solve 
fields, types, nested structures, etc.. It's the only real way to ensure column 
level data integrity.

https://github.com/davlee1972/xml_to_json

Converts XML to valid JSON or JSONL Requires only two files to get started. 
Your XML file and the XSD schema file for that XML file.

-Original Message-
From: luoc  
Sent: Wednesday, April 6, 2022 5:01 AM
To: u...@drill.apache.org; dev@drill.apache.org
Subject: [DISCUSS] Add schema support for the XML format

External Email: Use caution with links and attachments


Hello dear driller,

Before starting the topic, I would like to do a simple survey :

1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB, 
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill drive 
to deduce the structure of the data. (or called SCHEMA), the code will get very 
complex and delicate.

For example, inferring array structure and numeric range. So, "provided schema" 
or "TO_JSON" may be good medicine :

Provided Schema

We can add the DTD or XML Schema (XSD) support for the XML. It can build all 
value vectors (Writer) before reading data, solving the fields, types, and 
complex nested.

However, a definition file is actually a rule validator that allows elements to 
appear 0 or more times. As a result, it is not possible to know if all elements 
exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not actually 
exist before reading the data.

We can build the top schema at the initial stage and add new value vectors as 
needed during the reading phase.

TO_JSON

Read and convert XML directly to JSON, using the JSON Reader for data 
resolution.

It makes it easier for us to query the XML data such as JSON, but requires 
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.


[DISCUSS] Add schema support for the XML format

2022-04-06 Thread luoc

Hello dear driller,

Before starting the topic, I would like to do a simple survey :

1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB, 
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill drive 
to deduce the structure of the data. (or called SCHEMA), the code will get very 
complex and delicate.

For example, inferring array structure and numeric range. So, "provided schema" 
or "TO_JSON" may be good medicine :

Provided Schema

We can add the DTD or XML Schema (XSD) support for the XML. It can build all 
value vectors (Writer) before reading data, solving the fields, types, and 
complex nested.

However, a definition file is actually a rule validator that allows elements to 
appear 0 or more times. As a result, it is not possible to know if all elements 
exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not actually 
exist before reading the data.

We can build the top schema at the initial stage and add new value vectors as 
needed during the reading phase.

TO_JSON

Read and convert XML directly to JSON, using the JSON Reader for data 
resolution.

It makes it easier for us to query the XML data such as JSON, but requires 
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc