Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Charles Givre Thu, 18 Nov 2021 04:30:11 -0800

HI James, 
I do think it might be time to start considering creating a wiki of breaking 
changes for a Drill 2.0.  I'd also concur that having tons of config options 
that don't really add value is not a good use of config options as it leads to 
the creation of a lot of technical debt. I'll start a wiki page and put this on 
there.


In the mean time, I may submit a PR that changes the default value of 
extractHeaders for CSV to true.  I don't really see that as a breaking change 
in that a user can simply change that flag and the previous behavior is 
restored.
Best,
-- C



> On Nov 18, 2021, at 2:34 AM, James Turton <[email protected]> wrote:
> 
> Definitely a +1 for this friendlier default behaviour and another +1 for the 
> prospect of increased consistency across format plugins.
> 
> My follow-up questions to the community.
> Since these are examples of user-breaking changes, and not just in niche 
> areas, are we approaching a point when we want to start working on Drill 2.x?
> Do we have other user-breaking or significant refactoring ideas that we've 
> been keeping stashed away in our heads, that would get their chance at life 
> from the fact that a 2.x Drill can defensibly exhibit some incompatibilities 
> with Drill 1.x?
> Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we record 
> such ideas?
> Would we be fine in terms of dev resources with supporting both bug fix 
> releases to a 1.x series and also pushing forward in a 2.x series?
> My own feeling is that to get the most value from a good proposal such as the 
> below, we don't want to conceal everything behind default-false options in 
> order to avoid breaking Drill 1.x users, we want to embrace the breakage 
> which (to me) points to Drill 2.x.
> 
> On 2021/11/18 02:30, Charles Givre wrote:
>> Hello Drill Community, 
>> I would like to put forward some thoughts I've had relating to the CSV 
>> reader in Drill.  I would like to propose a few changes which could actually 
>> be breaking changes, so I wanted to see if there are any strongly held 
>> opinions in the community.  Here goes:
>> 
>> The Problems:
>> 1.  The default behavior for Drill is to leave the extractColumnHeaders 
>> option as false.  When a user queries a CSV file this way, the results are 
>> returned in a list of columns called columns.  Thus if a user wants the 
>> first column, they would project columns[0].  I have never been a fan of 
>> this behavior.  Even though Drill ships with the csvh file extension which 
>> enables the header extraction, this is not a commonly used file format.  
>> Furthermore, the returned results (the column list) does not work well with 
>> BI tools. 
>> 
>> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>> 
>> Proposed Changes:
>> The overall goal is to make it easier to query CSV data and also to make the 
>> behavior more consistent across format plugins.
>> 1.  Change the default behavior and set the extractHeaders to true. 
>> 2.  Other formats, like the excel reader, read tables directly into columns. 
>>  If the header is not known, Drill assigns a name of field_n.  I would 
>> propose replacing the `columns` array with a model similar to the Excel 
>> reader. 
>> 3.  Implement schema discovery (data types) with an allTextMode option 
>> similar to the JSON reader.  When the allTextMode is disabled, the CSV 
>> reader would attempt to infer data types. 
>> 
>> Since there are some breaking changes here, I'd like to ask if people have 
>> any strong feelings on this topic or suggestions. 
>> Thanks!,
>> -- C
>> 
>> 
>> 
> 
> <dzamo.vcf>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Reply via email to