[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

paul-rogers Fri, 06 Oct 2017 11:31:00 -0700

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143264522
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, 
List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    --- End diff --
    
    What are the semantics of `getType()`? Does it return null if there is no 
such type? If so, then we can retrieve the type and check if it is null. 
Otherwise, if it throws an exception, perhaps we can catch that. Either way, we 
avoid two searches for the same column and two conversions of the path to lower 
case.
    
    Note also that a recent change deprecated `getPath()`. The preferred form 
is `getName()` so it is clear that we are getting a single name part. If the 
Parquet column is nested (a.b.c, say), then we have to explicitly handle the 
multiple name parts as Drill does support names with dots and one cannot simply 
concatenate or split a path to get a string. That is, `["a.b", "c"]` is two 
fields, `["a", "b", "c"]` is three, but both are represented as a full path as 
"a.b.c", creating an ambiguity.

---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Reply via email to