paul-rogers opened a new pull request #1993: DRILL-7601: Shift column 
conversion to reader from scan framework
URL: https://github.com/apache/drill/pull/1993
 
 
   ## Description
   
   Moves scan operator type conversion code into readers and out of the scan 
framework.
   
   At the time we implemented provided schemas with the text reader, the best 
path forward appeared to be to perform column type conversions within the scan 
framework including deep in the column writer structure.
   
   Experience with other readers has shown that the text reader is a special 
case: it always writes strings, which Drill-provided converters can parse into 
other types. Other readers, however are not so simple: they often have their 
own source structures which must be mated to a column reader, and so conversion 
is generally best done in the reader where it can be specific to the nuances of 
each reader.
   
   Since conversion is reader-specific, and uses types known only to that one 
reader, it cannot be generic in the scan framework.
   
   This PR is part of a [larger 
project](https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model)
 to implement an overall design for combining projection, provided schemas and 
reader schemas into an overall Drill schema design.
   
   A side benefit is that the column writers become simpler without the 
scan-specific conversion code. This helps us to use the column writers in other 
operators in future work.
   
   ### Schema Handling in the Scan Operator
   
   The scan schema mechanism works as follows:
   
   * The execution plan provides the projection list. Prior PRs parsed that 
list into a set of column-like structures.
   * A scan consists of a set of readers. Each reader works with an *input 
source*.
   * Each input source defines an *input schema* in some source-specific format 
(such as a JDBC format, a list of CSV columns, etc.)
   * The reader converts the source schema to a Drill format *reader schema*, 
converting  from source to Drill types as needed.
   * EVF uses the projection list to decide which reader column to project 
(that is, to create a vector to store data for), and which to not project 
(which means to create a dummy column writer.)
   * The scan framework notices which projected columns are neither implicit 
columns  nor provided in the reader schema. The scan framework creates "null" 
columns as placeholders. (This is where things often go off the rails since the 
scan has no  idea what type to use for the null column.)
   * The scan combines the reader schema, implicit columns, and the null 
columns to  produce the scan's *output schema* which is then consumed by the 
downstream  operator.
   
   Drill also supports a provided schema:
   
   * The execution plan can optionally include a *provided schema*. When 
present, this schema gives the name (case) and type to be used for each column. 
That is, the provided schema states the type of vectors to be produced.
   * If the provided schema is *strict*, it acts as projection filter: any 
reader column  not in the provided schema is treated as unprojected.
   * Data produced by the reader is converted from the type the reader produces 
to the type of the provided schema column. (This is the gist of the change in 
this PR, see below.)
   * When computing null columns, the scan operator can avoid type conflicts by 
using the  column type from the provided schema.
   
   ### Scan Schema Column Adapters
   
   This PR shifts the conversion step. Before this PR, conversion happened deep 
inside EVF.
   With this PR, conversion happens in the reader as part of the source-schema 
to reader-schema
   conversion.
   
   The idea is that each reader will use some form of *column adapter* (what 
we've sometimes called a *column shim*) to convert from source-specific form to 
Drill form. The recently-revised Avro format plugin is a great example.
   
   A column adapter conceptually has three parts:
   
   * A "front end" that obtains (or accepts) data in some source-specific way.
   * A conversion step that converts data from the source format to a 
Drill-compatible type.
   * A "back end" that writes the data to a vector using a Drill column writer.
   
   As it turns out, the first two steps are unique to each reader, only the 
back end is common.
   
   In some cases, such as CSV, the "front end" can be in the form of the 
typical Java types (`int`, `String`, etc.) In this case, we want to write to 
each column via a single method call. To do this, this PR adds a `ValueWriter` 
interface which the `ScalarWriter` extends. The reader can create its own 
adapters by extending the `ValueWriter` interface so that the reader can freely 
mix "plain" writers
   and column converters.
   
   An obvious special case is when the reader, such as CSV, wants to use a 
standard set of conversions. Drill already provided such conversions. They now 
move from the accessor package into the scan operator package as they are 
specific just to some readers; no other operator needs such functionality. 
Standard type conversions extend from a new `DirectConversion` class which 
extends `ValueWriter`.
   
   ### Text Reader
   
   Modified the text reader to perform type conversions driven by the provided 
schema, using the standard conversions described above. Added tests to very 
operation in both the "with" and "without" headers case.
   
   Note that, with a provided schema, the "without headers" case will *not* 
produce the single `columns` column; it will instead produce the set of columns 
listed in the provided schema. The provided schema must list the first *n* 
columns with no holes. However, there can be fields at the end of the record 
left off of the provided schema, these fields are ignored.
   
   The `TextParsingSettings` and text parser has long supported an option to 
trim leading and trailing white space, but the option was hard-coded off. Added 
a provisioned-schema property to allow enabling this feature for all columns in 
a table, or for specific columns.
   
   Also exposed the "parse unescaped quotes" option the same way.
   
   ### Other Readers
   
   Updated the Log, Avro and HDF5 plugins to insert conversions where needed.
   
   Restructured the HDF5 column adapters a bit to simplify the code.
   
   ### Other Changes
   
   * Large amount of code cleanup including standardizing names.
   * Restructure the type conversion classes to work on top of, rather than 
inside, column converters.
   * Remove all the "plumbing" which passed the old converter factory down 
through the row set, result set loader and column writer classes.
   * Move conversion-related properties from the conversion class to the 
metadata classes.
   * Adjust the `StandardConversions` class to work with a provided column 
schema as the conversion target.
   * Added the provided schema to the schema negtiator given to each reader. 
Necessary because the reader now does conversion and so needs the provided 
schema, if available.
   * Removes the `ProjectionSet` class which attempted to combine the 
projection, provided schema and type conversion into a single concept.
   * Added a replacement `ProjectionFilter` which only handles projection: 
based on the projection list and optionally the provided schema.
   * Modified the Avro `ColumnConvertersUtil` to create a new 
`ColumnConverterFactory` which integrates standard type conversions based on 
the provided schema into the Avro-to-Drill conversions.
   
   ### Future Steps
   
   This is a first step. A future PR will improve schema handling in the scan 
framework to more clearly implement the schema "pipeline" outlined above and in 
the referenced design. As a result, some of the schema handling code in this PR 
is a bit ad-hoc to avoid this PR from growing even larger.
   
   ## Documentation
   
   No user visible changes except the addition to the text reader provided 
schema properties.
   The documentation should be updated to clearly explain the properties (along 
with the two new ones added here.)
   
   ## Testing
   
   Removed obsolete tests. Added several new tests. Reran the entire unit test 
suite. Fixed issues in several readers (see above) resulting from the changes 
in this PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to