Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

James Turton Wed, 17 Nov 2021 23:34:16 -0800

Definitely a +1 for this friendlier default behaviour and another +1 forthe prospect of increased consistency across format plugins.


My follow-up questions to the community.


1. Since these are examples of user-breaking changes, and not just in
   niche areas, are we approaching a point when we want to start
   working on Drill 2.x?
2. Do we have other user-breaking or significant refactoring ideas that
   we've been keeping stashed away in our heads, that would get their
   chance at life from the fact that a 2.x Drill can defensibly exhibit
   some incompatibilities with Drill 1.x?
3. Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where
   we record such ideas?
4. Would we be fine in terms of dev resources with supporting both bug
   fix releases to a 1.x series and also pushing forward in a 2.x series?

My own feeling is that to get the most value from a good proposal suchas the below, we don't want to conceal everything behind default-falseoptions in order to avoid breaking Drill 1.x users, we want to embracethe breakage which (to me) points to Drill 2.x.


On 2021/11/18 02:30, Charles Givre wrote:

Hello Drill Community,
I would like to put forward some thoughts I've had relating to the CSV reader 
in Drill.  I would like to propose a few changes which could actually be 
breaking changes, so I wanted to see if there are any strongly held opinions in 
the community.  Here goes:

The Problems:
1.  The default behavior for Drill is to leave the extractColumnHeaders option 
as false.  When a user queries a CSV file this way, the results are returned in 
a list of columns called columns.  Thus if a user wants the first column, they 
would project columns[0].  I have never been a fan of this behavior.  Even 
though Drill ships with the csvh file extension which enables the header 
extraction, this is not a commonly used file format.  Furthermore, the returned 
results (the column list) does not work well with BI tools.

2.  The CSV reader does not attempt to do any kind of data type discovery.

Proposed Changes:
The overall goal is to make it easier to query CSV data and also to make the 
behavior more consistent across format plugins.
1.  Change the default behavior and set the extractHeaders to true.
2.  Other formats, like the excel reader, read tables directly into columns.  
If the header is not known, Drill assigns a name of field_n.  I would propose 
replacing the `columns` array with a model similar to the Excel reader.
3.  Implement schema discovery (data types) with an allTextMode option similar 
to the JSON reader.  When the allTextMode is disabled, the CSV reader would 
attempt to infer data types.

Since there are some breaking changes here, I'd like to ask if people have any 
strong feelings on this topic or suggestions.
Thanks!,
-- C

null

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Reply via email to