Hi Arina, Just did another PR towards completing the scan operator revision to work with the "result set loader." This one is mostly plumbing to implement projection with the scan operator. It generalizes lots of code that already exists into a single, unified mechanism.
Basically, this one takes care of mapping from the data source's schema to the projection list provided by the query (empty, wildcard or list of columns). It provides mechanisms for the all-null columns (our famous nullable INT), for the implicit columns and so on. This particular solution ensures that the data source only worries about populating its own vectors; it does not worry about Drill-specific columns (nulls or implicit), nor does it worry about projection if it is just reading a set of records with a fixed schema. This PR includes the foundation for file-level schema support. The idea is that the scan operator will ask each reader if it has an up-front schema. Something like Parquet or JDBC can get the schema from the data (or data source) itself. Something like JSON could get the schema from a schema file, or information passed along with the reader's physical plan (like what JC did for MsgPack.) The mechanism still allows schemas to be "discovered" on the fly, and has quite a bit of code to handle the many bizarre cases that can occur (and that we've been discussing.) This is called "schema smoothing" trying to handle the case that column x appears in, say, file 1, but not in file 2, and shows up again in file 3. The next PR will assemble this stuff into a scan framework, after which I can add the three readers: mock, delimited and JSON. My goal is that, with the scan framework, and the CSV and JSON examples, that the team can retrofit other readers as the need arises. The entire mechanism, and the design goals behind it, are documented in [1]. Thanks, - Paul [1] https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades On Thursday, October 11, 2018, 2:51:22 AM PDT, Arina Yelchiyeva <[email protected]> wrote: Paul, sounds good. I like the idea of mock scanner being done first, since besides csv and json, other readers would have to be updated as well. Could you please share Jira number(-s) if any so I can follow them? Kind regards, Arina
