Cool, I'll do the review. Kind regards, Arina
On Fri, Oct 12, 2018 at 9:31 AM Paul Rogers <[email protected]> wrote: > Hi Arina, > > Just did another PR towards completing the scan operator revision to work > with the "result set loader." This one is mostly plumbing to implement > projection with the scan operator. It generalizes lots of code that already > exists into a single, unified mechanism. > > Basically, this one takes care of mapping from the data source's schema to > the projection list provided by the query (empty, wildcard or list of > columns). It provides mechanisms for the all-null columns (our famous > nullable INT), for the implicit columns and so on. This particular solution > ensures that the data source only worries about populating its own vectors; > it does not worry about Drill-specific columns (nulls or implicit), nor > does it worry about projection if it is just reading a set of records with > a fixed schema. > > This PR includes the foundation for file-level schema support. The idea is > that the scan operator will ask each reader if it has an up-front schema. > Something like Parquet or JDBC can get the schema from the data (or data > source) itself. Something like JSON could get the schema from a schema > file, or information passed along with the reader's physical plan (like > what JC did for MsgPack.) > > The mechanism still allows schemas to be "discovered" on the fly, and has > quite a bit of code to handle the many bizarre cases that can occur (and > that we've been discussing.) This is called "schema smoothing" trying to > handle the case that column x appears in, say, file 1, but not in file 2, > and shows up again in file 3. > > The next PR will assemble this stuff into a scan framework, after which I > can add the three readers: mock, delimited and JSON. > > My goal is that, with the scan framework, and the CSV and JSON examples, > that the team can retrofit other readers as the need arises. > > The entire mechanism, and the design goals behind it, are documented in > [1]. > > Thanks, > - Paul > > [1] https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades > > > > > > > On Thursday, October 11, 2018, 2:51:22 AM PDT, Arina Yelchiyeva < > [email protected]> wrote: > > Paul, > > sounds good. I like the idea of mock scanner being done first, since > besides csv and json, other readers would have to be updated as well. > Could you please share Jira number(-s) if any so I can follow them? > > Kind regards, > Arina > > >
