Definitely a +1 for this friendlier default behaviour and another +1 for
the prospect of increased consistency across format plugins.
My follow-up questions to the community.
1. Since these are examples of user-breaking changes, and not just in
niche areas, are we approaching a point when we want to start
working on Drill 2.x?
2. Do we have other user-breaking or significant refactoring ideas that
we've been keeping stashed away in our heads, that would get their
chance at life from the fact that a 2.x Drill can defensibly exhibit
some incompatibilities with Drill 1.x?
3. Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where
we record such ideas?
4. Would we be fine in terms of dev resources with supporting both bug
fix releases to a 1.x series and also pushing forward in a 2.x series?
My own feeling is that to get the most value from a good proposal such
as the below, we don't want to conceal everything behind default-false
options in order to avoid breaking Drill 1.x users, we want to embrace
the breakage which (to me) points to Drill 2.x.
On 2021/11/18 02:30, Charles Givre wrote:
Hello Drill Community,
I would like to put forward some thoughts I've had relating to the CSV reader
in Drill. I would like to propose a few changes which could actually be
breaking changes, so I wanted to see if there are any strongly held opinions in
the community. Here goes:
The Problems:
1. The default behavior for Drill is to leave the extractColumnHeaders option
as false. When a user queries a CSV file this way, the results are returned in
a list of columns called columns. Thus if a user wants the first column, they
would project columns[0]. I have never been a fan of this behavior. Even
though Drill ships with the csvh file extension which enables the header
extraction, this is not a commonly used file format. Furthermore, the returned
results (the column list) does not work well with BI tools.
2. The CSV reader does not attempt to do any kind of data type discovery.
Proposed Changes:
The overall goal is to make it easier to query CSV data and also to make the
behavior more consistent across format plugins.
1. Change the default behavior and set the extractHeaders to true.
2. Other formats, like the excel reader, read tables directly into columns.
If the header is not known, Drill assigns a name of field_n. I would propose
replacing the `columns` array with a model similar to the Excel reader.
3. Implement schema discovery (data types) with an allTextMode option similar
to the JSON reader. When the allTextMode is disabled, the CSV reader would
attempt to infer data types.
Since there are some breaking changes here, I'd like to ask if people have any
strong feelings on this topic or suggestions.
Thanks!,
-- C
null