[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

paul-rogers Mon, 16 Oct 2017 10:42:32 -0700

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r144074677
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
    @@ -156,20 +161,46 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    -
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +  private static boolean isComplex(ParquetMetadata footer, 
List<SchemaPath> columns) {
    +    /*
    +    ParquetRecordReader is not able to read any nested columns and is not 
able to handle repeated columns.
    +    It only handles flat column and optional column.
    +    If it is a wildcard query, we check every columns in the metadata.
    +    If not, we only check the projected columns.
    +    */
    --- End diff --
    
    Very small request: this is a great Javadoc comment, so please use this 
form:
    
    ```
    /**
     * Your comment here.
     */
    ```
    
    It may also be worth pointing out that the algorithm here works regardless 
of the form of the column:
    
    * `a`: Must consider the type of column `a` in Parquet.
    * `a.b`: The top level column `a` must be a structure in Parquet. (If not, 
then presumably an error is thrown later on.) So, no need to check `b`.
    * `a[10]`: The column `a` must be an array (repeated), so no need to check 
the column `SchemaPath` itself. Again, presumably, Drill will throw an error 
internally if it turns out that `a` is not an array.

---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Reply via email to