Scan mechanism PR

Paul Rogers Thu, 11 Oct 2018 23:31:48 -0700

Hi Arina,

Just did another PR towards completing the scan operator revision to work with 
the "result set loader." This one is mostly plumbing to implement projection 
with the scan operator. It generalizes lots of code that already exists into a 
single, unified mechanism.


Basically, this one takes care of mapping from the data source's schema to the 
projection list provided by the query (empty, wildcard or list of columns). It 
provides mechanisms for the all-null columns (our famous nullable INT), for the 
implicit columns and so on. This particular solution ensures that the data 
source only worries about populating its own vectors; it does not worry about 
Drill-specific columns (nulls or implicit), nor does it worry about projection 
if it is just reading a set of records with a fixed schema.

This PR includes the foundation for file-level schema support. The idea is that 
the scan operator will ask each reader if it has an up-front schema. Something 
like Parquet or JDBC can get the schema from the data (or data source) itself. 
Something like JSON could get the schema from a schema file, or information 
passed along with the reader's physical plan (like what JC did for MsgPack.)

The mechanism still allows schemas to be "discovered" on the fly, and has quite 
a bit of code to handle the many bizarre cases that can occur (and that we've 
been discussing.) This is called "schema smoothing" trying to handle the case 
that column x appears in, say, file 1, but not in file 2, and shows up again in 
file 3.

The next PR will assemble this stuff into a scan framework, after which I can 
add the three readers: mock, delimited and JSON.

My goal is that, with the scan framework, and the CSV and JSON examples, that 
the team can retrofit other readers as the need arises.

The entire mechanism, and the design goals behind it, are documented in [1].

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades




 

    On Thursday, October 11, 2018, 2:51:22 AM PDT, Arina Yelchiyeva 
<[email protected]> wrote:  
 
 Paul,

sounds good. I like the idea of mock scanner being done first, since
besides csv and json, other readers would have to be updated as well.
Could you please share Jira number(-s) if any so I can follow them?

Kind regards,
Arina

Scan mechanism PR

Reply via email to