Re: Scan mechanism PR

Arina Yelchiyeva Fri, 12 Oct 2018 07:34:06 -0700

Cool, I'll do the review.

Kind regards,
Arina


On Fri, Oct 12, 2018 at 9:31 AM Paul Rogers <[email protected]>
wrote:

> Hi Arina,
>
> Just did another PR towards completing the scan operator revision to work
> with the "result set loader." This one is mostly plumbing to implement
> projection with the scan operator. It generalizes lots of code that already
> exists into a single, unified mechanism.
>
> Basically, this one takes care of mapping from the data source's schema to
> the projection list provided by the query (empty, wildcard or list of
> columns). It provides mechanisms for the all-null columns (our famous
> nullable INT), for the implicit columns and so on. This particular solution
> ensures that the data source only worries about populating its own vectors;
> it does not worry about Drill-specific columns (nulls or implicit), nor
> does it worry about projection if it is just reading a set of records with
> a fixed schema.
>
> This PR includes the foundation for file-level schema support. The idea is
> that the scan operator will ask each reader if it has an up-front schema.
> Something like Parquet or JDBC can get the schema from the data (or data
> source) itself. Something like JSON could get the schema from a schema
> file, or information passed along with the reader's physical plan (like
> what JC did for MsgPack.)
>
> The mechanism still allows schemas to be "discovered" on the fly, and has
> quite a bit of code to handle the many bizarre cases that can occur (and
> that we've been discussing.) This is called "schema smoothing" trying to
> handle the case that column x appears in, say, file 1, but not in file 2,
> and shows up again in file 3.
>
> The next PR will assemble this stuff into a scan framework, after which I
> can add the three readers: mock, delimited and JSON.
>
> My goal is that, with the scan framework, and the CSV and JSON examples,
> that the team can retrofit other readers as the need arises.
>
> The entire mechanism, and the design goals behind it, are documented in
> [1].
>
> Thanks,
> - Paul
>
> [1] https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades
>
>
>
>
>
>
>     On Thursday, October 11, 2018, 2:51:22 AM PDT, Arina Yelchiyeva <
> [email protected]> wrote:
>
>  Paul,
>
> sounds good. I like the idea of mock scanner being done first, since
> besides csv and json, other readers would have to be updated as well.
> Could you please share Jira number(-s) if any so I can follow them?
>
> Kind regards,
> Arina
>
>
>

Re: Scan mechanism PR

Reply via email to