Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Дмитрий Владимирович Thu, 18 Nov 2021 04:32:07 -0800

Please exclude me from conversation

чт, 18 нояб. 2021 г., 13:30 Charles Givre <[email protected]>:


> HI James,
> I do think it might be time to start considering creating a wiki of
> breaking changes for a Drill 2.0.  I'd also concur that having tons of
> config options that don't really add value is not a good use of config
> options as it leads to the creation of a lot of technical debt. I'll start
> a wiki page and put this on there.
>
> In the mean time, I may submit a PR that changes the default value of
> extractHeaders for CSV to true.  I don't really see that as a breaking
> change in that a user can simply change that flag and the previous behavior
> is restored.
> Best,
> -- C
>
>
>
> > On Nov 18, 2021, at 2:34 AM, James Turton <[email protected]> wrote:
> >
> > Definitely a +1 for this friendlier default behaviour and another +1 for
> the prospect of increased consistency across format plugins.
> >
> > My follow-up questions to the community.
> > Since these are examples of user-breaking changes, and not just in niche
> areas, are we approaching a point when we want to start working on Drill
> 2.x?
> > Do we have other user-breaking or significant refactoring ideas that
> we've been keeping stashed away in our heads, that would get their chance
> at life from the fact that a 2.x Drill can defensibly exhibit some
> incompatibilities with Drill 1.x?
> > Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we
> record such ideas?
> > Would we be fine in terms of dev resources with supporting both bug fix
> releases to a 1.x series and also pushing forward in a 2.x series?
> > My own feeling is that to get the most value from a good proposal such
> as the below, we don't want to conceal everything behind default-false
> options in order to avoid breaking Drill 1.x users, we want to embrace the
> breakage which (to me) points to Drill 2.x.
> >
> > On 2021/11/18 02:30, Charles Givre wrote:
> >> Hello Drill Community,
> >> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
> >>
> >> The Problems:
> >> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
> >>
> >> 2.  The CSV reader does not attempt to do any kind of data type
> discovery.
> >>
> >> Proposed Changes:
> >> The overall goal is to make it easier to query CSV data and also to
> make the behavior more consistent across format plugins.
> >> 1.  Change the default behavior and set the extractHeaders to true.
> >> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> >> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
> >>
> >> Since there are some breaking changes here, I'd like to ask if people
> have any strong feelings on this topic or suggestions.
> >> Thanks!,
> >> -- C
> >>
> >>
> >>
> >
> > <dzamo.vcf>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Reply via email to