[DISCUSS] Refactoring Drill's CSV (Text) Reader

Charles Givre Wed, 17 Nov 2021 16:31:12 -0800

Hello Drill Community, 
I would like to put forward some thoughts I've had relating to the CSV reader 
in Drill.  I would like to propose a few changes which could actually be 
breaking changes, so I wanted to see if there are any strongly held opinions in 
the community.  Here goes:


The Problems:
1.  The default behavior for Drill is to leave the extractColumnHeaders option 
as false.  When a user queries a CSV file this way, the results are returned in 
a list of columns called columns.  Thus if a user wants the first column, they 
would project columns[0].  I have never been a fan of this behavior.  Even 
though Drill ships with the csvh file extension which enables the header 
extraction, this is not a commonly used file format.  Furthermore, the returned 
results (the column list) does not work well with BI tools. 

2.  The CSV reader does not attempt to do any kind of data type discovery.

Proposed Changes:
The overall goal is to make it easier to query CSV data and also to make the 
behavior more consistent across format plugins.
1.  Change the default behavior and set the extractHeaders to true. 
2.  Other formats, like the excel reader, read tables directly into columns.  
If the header is not known, Drill assigns a name of field_n.  I would propose 
replacing the `columns` array with a model similar to the Excel reader. 
3.  Implement schema discovery (data types) with an allTextMode option similar 
to the JSON reader.  When the allTextMode is disabled, the CSV reader would 
attempt to infer data types. 

Since there are some breaking changes here, I'd like to ask if people have any 
strong feelings on this topic or suggestions. 
Thanks!,
-- C

[DISCUSS] Refactoring Drill's CSV (Text) Reader

Reply via email to