[GitHub] okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader more often

GitBox Mon, 09 Jul 2018 09:35:45 -0700

okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader 
more often
URL: https://github.com/apache/drill/pull/1368
 
 
   # DRILL-5797: use Parquet new reader more often
   
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in 
the JIRA. Previously  new reader was only used if file schema did not contain 
any single complex column. With this change, new reader will be used on a 
complex schema in case queried column list does not contain any complex one 
which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following 
modifications had to be made:
   - `ParquetReaderUtility` class - modified and added several functions to 
enable it working with nested schema. E.g. one limitation was explicitly 
referencing top level schema element path with `column.getPath()[0]` in several 
locations. Top level schema element path was also used in building path to 
SchemaElement map which caused map corruption for cases when schema contained 
columns `a` and `b`.`a` (for both schema elements key `a` was used overwriting 
the map entry).
   - `ParquetSchema` - `fieldSelected` function replaced with `columnSelected` 
in order to enable it functioning with full path. Previously, it would fail on 
cases when schema contains columns `a` and `b`.`a` as both schema paths would 
be marked as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full 
path; also, replaced parameter passed to 
`ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to 
`column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on 
FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion 
that was failing complex parquet tests. `column.getTypeLength()` provides 
correct result. In fact, I am not sure if this is Parquet bug - possibly TBD 
item.
   - `AbstractParquetScanBatchCreator` - added a function which utilizes 
`ParquetReaderUtility` functions to identify if query columns list contains 
complex column.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added 
for newly introduced methods except for 
`ParquetReaderUtility.buildFullColumnPath()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader more often

Reply via email to