[ 
https://issues.apache.org/jira/browse/ORC-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Sharma updated ORC-323:
------------------------------
    Description: 
*1. Predicate Pushdown For Nested field*

*1.1 Objective*

In the ORC(Optimized Row Columnar) all the primitive type column consist of 
index. Predicate refer to the column name in where clause and pushdown mean 
skipping rows groups, strips and block while reading by comparing the meta 
store in the strips. Meta consist of max, sum ,min value present in the given 
column. 

Currently predicate pushdown only work for top level column of the schema. 

Extending the Predicate Pushdown for nested structure in hive.  


*1.2 Current state *- 
 
*1.2.1 Schema*
struct<int1:int, complex:struct<int2:int,String1:string>>
 
*1.2.2 Search Argument  *
SearchArgument sarg = SearchArgumentFactory.newBuilder()
       .startAnd()
       .startNot()
       .lessThan(“int2", PredicateLeaf.Type.LONG, 300000L)
       .end()
       .lessThan("int2", PredicateLeaf.Type.LONG, 600000L)
       .end()
       .build();
 
 
 
 
*1.2.3 Pushdown Predicate not supported in Nested field in ORC*
 
private boolean[] populatePpdSafeConversion() {
    if (fileSchema == null || readerSchema == null || readerFileTypes == null) {
      return null;
    }

    boolean[] result = new boolean[readerSchema.getMaximumId() + 1];
    boolean safePpd = validatePPDConversion(fileSchema, readerSchema);
    result[readerSchema.getId()] = safePpd;
    List<TypeDescription> children = readerSchema.getChildren();
    if (children != null) {
      for (TypeDescription child : children) {
        TypeDescription fileType = getFileType(child.getId());
        safePpd = validatePPDConversion(fileType, child);
        result[child.getId()] = safePpd;
      }
    }
    return result;
  }

In populatePpdSafeConversion() this function only check the conversion 
validation for top level field. So validation of nested field search argument 
fails.


static int findColumns(SchemaEvolution evolution,
                         String columnName) {
    TypeDescription readerSchema = evolution.getReaderBaseSchema();
    List<String> fieldNames = readerSchema.getFieldNames();
    List<TypeDescription> children = readerSchema.getChildren();
    for (int i = 0; i < fieldNames.size(); ++i) {
      if (columnName.equals(fieldNames.get(i))) {
        TypeDescription result = evolution.getFileType(children.get(i));
        return result == null ? -1 : result.getId();
      }
    }
    return -1;
  }


In findColumns() all the only top level column is referred. “Int2” is nested 
column due to which  “-1” is return instead of index of “int2”.

*1.2.4 Result -*

PPD is not working for int2 field in the search argument.


*1.3 Expected state - *

*1.3.1 Schema*
struct<int1:int, complex:struct<int2:int,String1:string>>
 
*1.3.2 Query*
Replacing Column name in PredicateLeaf with fully qualified column path.
 
SearchArgument sarg = SearchArgumentFactory.newBuilder()
       .startAnd()
       .startNot()
       .lessThan(“complex.int2", PredicateLeaf.Type.LONG, 300000L)
       .end()
       .lessThan("complex.int2", PredicateLeaf.Type.LONG, 600000L)
       .end()
       .build();
 
*1.3.3 Pushdown Predicate support in Nested field*

https://github.com/apache/orc/pull/232


*1.3.4 Result*

PPD is working for complex.int2 field in the search argument.

  was:
ORC supports predicate pushdown (block skipping) for ORC hive tables only on 
top-level fields. 
struct<int1:int,string1:string>

ORC should also support block skipping on nested fields (within structs).
struct<int1:int,complex:struct<int2:int,String1:string>>

Advantage of having predicate pushdown in nested filed will allow ORC to skip 
blocks while having comparison at nested fields level. Skipping block will 
result in reduction of physical memory and compute.


> Predicate push down for nested fields
> -------------------------------------
>
>                 Key: ORC-323
>                 URL: https://issues.apache.org/jira/browse/ORC-323
>             Project: ORC
>          Issue Type: Improvement
>            Reporter: Ashish Sharma
>            Priority: Major
>
> *1. Predicate Pushdown For Nested field*
> *1.1 Objective*
> In the ORC(Optimized Row Columnar) all the primitive type column consist of 
> index. Predicate refer to the column name in where clause and pushdown mean 
> skipping rows groups, strips and block while reading by comparing the meta 
> store in the strips. Meta consist of max, sum ,min value present in the given 
> column. 
> Currently predicate pushdown only work for top level column of the schema. 
> Extending the Predicate Pushdown for nested structure in hive.  
> *1.2 Current state *- 
>  
> *1.2.1 Schema*
> struct<int1:int, complex:struct<int2:int,String1:string>>
>  
> *1.2.2 Search Argument  *
> SearchArgument sarg = SearchArgumentFactory.newBuilder()
>        .startAnd()
>        .startNot()
>        .lessThan(“int2", PredicateLeaf.Type.LONG, 300000L)
>        .end()
>        .lessThan("int2", PredicateLeaf.Type.LONG, 600000L)
>        .end()
>        .build();
>  
>  
>  
>  
> *1.2.3 Pushdown Predicate not supported in Nested field in ORC*
>  
> private boolean[] populatePpdSafeConversion() {
>     if (fileSchema == null || readerSchema == null || readerFileTypes == 
> null) {
>       return null;
>     }
>     boolean[] result = new boolean[readerSchema.getMaximumId() + 1];
>     boolean safePpd = validatePPDConversion(fileSchema, readerSchema);
>     result[readerSchema.getId()] = safePpd;
>     List<TypeDescription> children = readerSchema.getChildren();
>     if (children != null) {
>       for (TypeDescription child : children) {
>         TypeDescription fileType = getFileType(child.getId());
>         safePpd = validatePPDConversion(fileType, child);
>         result[child.getId()] = safePpd;
>       }
>     }
>     return result;
>   }
> In populatePpdSafeConversion() this function only check the conversion 
> validation for top level field. So validation of nested field search argument 
> fails.
> static int findColumns(SchemaEvolution evolution,
>                          String columnName) {
>     TypeDescription readerSchema = evolution.getReaderBaseSchema();
>     List<String> fieldNames = readerSchema.getFieldNames();
>     List<TypeDescription> children = readerSchema.getChildren();
>     for (int i = 0; i < fieldNames.size(); ++i) {
>       if (columnName.equals(fieldNames.get(i))) {
>         TypeDescription result = evolution.getFileType(children.get(i));
>         return result == null ? -1 : result.getId();
>       }
>     }
>     return -1;
>   }
> In findColumns() all the only top level column is referred. “Int2” is nested 
> column due to which  “-1” is return instead of index of “int2”.
> *1.2.4 Result -*
> PPD is not working for int2 field in the search argument.
> *1.3 Expected state - *
> *1.3.1 Schema*
> struct<int1:int, complex:struct<int2:int,String1:string>>
>  
> *1.3.2 Query*
> Replacing Column name in PredicateLeaf with fully qualified column path.
>  
> SearchArgument sarg = SearchArgumentFactory.newBuilder()
>        .startAnd()
>        .startNot()
>        .lessThan(“complex.int2", PredicateLeaf.Type.LONG, 300000L)
>        .end()
>        .lessThan("complex.int2", PredicateLeaf.Type.LONG, 600000L)
>        .end()
>        .build();
>  
> *1.3.3 Pushdown Predicate support in Nested field*
> https://github.com/apache/orc/pull/232
> *1.3.4 Result*
> PPD is working for complex.int2 field in the search argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to