[GitHub] okalinin opened a new pull request #1370: DRILL-5797: Use Parquet new reader more often

GitBox Tue, 10 Jul 2018 01:44:11 -0700

okalinin opened a new pull request #1370: DRILL-5797: Use Parquet new reader 
more often
URL: https://github.com/apache/drill/pull/1370
 
 
   # DRILL-5797: use Parquet new reader more often
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in 
the JIRA. Previously new reader was only used if file schema did not contain 
any single complex column. With this change, new reader will be used on a 
complex schema in case queried column list does not contain any complex one 
which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following 
modifications had to be made:
   
   - `ParquetReaderUtility` class - modified and added several functions to 
enable it working with nested schema. E.g. one limitation was explicitly 
referencing top level schema element path with `column.getPath()[0]` in several 
locations. Top level schema element path was also used in building path to 
`SchemaElement` map which caused map corruption for cases when schema contained 
columns `a` and `b.a` (for both schema elements key `a` was used overwriting 
the map entry).
   - `ParquetSchema` - `fieldSelected()` function replaced with 
`columnSelected()` in order to enable it functioning with full paths. 
Previously, it would fail on cases when schema contains columns a and b.a as 
both schema paths would be classified as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full 
path; also, replaced parameter passed to 
`ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to 
`column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on 
FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion 
that was failing complex parquet tests. `column.getTypeLength()` provides 
correct result. In fact, I am not sure if this is Parquet bug - possibly TBD 
item.
   - `AbstractParquetScanBatchCreator` - added function which utilizes 
`ParquetReaderUtility` functions to identify whether query columns list 
contains complex columns and thus whether query qualifies for new reader.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added 
for newly introduced methods except for 
`ParquetReaderUtility.buildFullColumnPath()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] okalinin opened a new pull request #1370: DRILL-5797: Use Parquet new reader more often

Reply via email to