AW: Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Z0ltrix Wed, 17 Nov 2021 22:53:52 -0800

I would appreciate such a change.

Each time i introduce drill to users i start with a csv example and its hard to 
explain why it has to be so difficult just to read a simple csv file.


Discover Datatypes would be cool, but it has not the highest priority. Casting 
by Users is fine until they have an intuitive way to query the strings.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Ted Dunning <[email protected]> schrieb am Donnerstag, 18. November 2021 um 
07:17:

> I think that these would be significant improvements.
> 

> The current behavior is pretty painful on average. Better defaults and just
> 

> a bit of deduction could pay off big. I even think that the presence of
> 

> headers might be pretty reliably inferred.
> 

> On Wed, Nov 17, 2021 at 4:31 PM Charles Givre [email protected] wrote:
> 

> > Hello Drill Community,
> > 

> > I would like to put forward some thoughts I've had relating to the CSV
> > 

> > reader in Drill. I would like to propose a few changes which could
> > 

> > actually be breaking changes, so I wanted to see if there are any strongly
> > 

> > held opinions in the community. Here goes:
> > 

> > The Problems:
> > 

> > 1.  The default behavior for Drill is to leave the extractColumnHeaders
> >     

> >     option as false. When a user queries a CSV file this way, the results 
> > are
> >     

> >     returned in a list of columns called columns. Thus if a user wants the
> >     

> >     first column, they would project columns[0]. I have never been a fan of
> >     

> >     this behavior. Even though Drill ships with the csvh file extension 
> > which
> >     

> >     enables the header extraction, this is not a commonly used file format.
> >     

> >     Furthermore, the returned results (the column list) does not work well 
> > with
> >     

> >     BI tools.
> >     

> > 2.  The CSV reader does not attempt to do any kind of data type discovery.
> >     

> > 

> > Proposed Changes:
> > 

> > The overall goal is to make it easier to query CSV data and also to make
> > 

> > the behavior more consistent across format plugins.
> > 

> > 1.  Change the default behavior and set the extractHeaders to true.
> > 2.  Other formats, like the excel reader, read tables directly into
> >     

> >     columns. If the header is not known, Drill assigns a name of field_n. I
> >     

> >     would propose replacing the `columns` array with a model similar to the
> >     

> >     Excel reader.
> > 3.  Implement schema discovery (data types) with an allTextMode option
> >     

> >     similar to the JSON reader. When the allTextMode is disabled, the CSV
> >     

> >     reader would attempt to infer data types.
> > 

> > Since there are some breaking changes here, I'd like to ask if people have
> > 

> > any strong feelings on this topic or suggestions.
> > 

> > Thanks!,
> > 

> > -- C

publickey - [email protected] - 0xF0E154C5.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

AW: Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Reply via email to