[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194996#comment-16194996
 ] 

ASF GitHub Bot commented on DRILL-5797:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143264522
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, 
List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    --- End diff --
    
    What are the semantics of `getType()`? Does it return null if there is no 
such type? If so, then we can retrieve the type and check if it is null. 
Otherwise, if it throws an exception, perhaps we can catch that. Either way, we 
avoid two searches for the same column and two conversions of the path to lower 
case.
    
    Note also that a recent change deprecated `getPath()`. The preferred form 
is `getName()` so it is clear that we are getting a single name part. If the 
Parquet column is nested (a.b.c, say), then we have to explicitly handle the 
multiple name parts as Drill does support names with dots and one cannot simply 
concatenate or split a path to get a string. That is, `["a.b", "c"]` is two 
fields, `["a", "b", "c"]` is three, but both are represented as a full path as 
"a.b.c", creating an ambiguity.


> Use more often the new parquet reader
> -------------------------------------
>
>                 Key: DRILL-5797
>                 URL: https://issues.apache.org/jira/browse/DRILL-5797
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>             Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to