[
https://issues.apache.org/jira/browse/ORC-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen O'Malley reassigned ORC-323:
---------------------------------
Assignee: Ashish Sharma
> Predicate push down for nested fields
> -------------------------------------
>
> Key: ORC-323
> URL: https://issues.apache.org/jira/browse/ORC-323
> Project: ORC
> Issue Type: Improvement
> Components: Java
> Reporter: Ashish Sharma
> Assignee: Ashish Sharma
> Priority: Minor
>
> *1. Predicate Pushdown For Nested field*
> *1.1 Objective*
> In the ORC(Optimized Row Columnar) all the primitive type column consist of
> index. Predicate refer to the column name in where clause and pushdown mean
> skipping rows groups, strips and block while reading by comparing the meta
> store in the strips. Meta consist of max, sum ,min value present in the given
> column.
> Currently predicate pushdown only work for top level column of the schema.
> Extending the Predicate Pushdown for nested structure in hive.
> *1.2 Current state *-
>
> *1.2.1 Schema*
> struct<int1:int, complex:struct<int2:int,String1:string>>
>
> *1.2.2 Search Argument *
> SearchArgument sarg = SearchArgumentFactory.newBuilder()
> .startAnd()
> .startNot()
> .lessThan(“int2", PredicateLeaf.Type.LONG, 300000L)
> .end()
> .lessThan("int2", PredicateLeaf.Type.LONG, 600000L)
> .end()
> .build();
>
>
>
>
> *1.2.3 Pushdown Predicate not supported in Nested field in ORC*
>
> private boolean[] populatePpdSafeConversion() {
> if (fileSchema == null || readerSchema == null || readerFileTypes ==
> null) {
> return null;
> }
> boolean[] result = new boolean[readerSchema.getMaximumId() + 1];
> boolean safePpd = validatePPDConversion(fileSchema, readerSchema);
> result[readerSchema.getId()] = safePpd;
> List<TypeDescription> children = readerSchema.getChildren();
> if (children != null) {
> for (TypeDescription child : children) {
> TypeDescription fileType = getFileType(child.getId());
> safePpd = validatePPDConversion(fileType, child);
> result[child.getId()] = safePpd;
> }
> }
> return result;
> }
> In populatePpdSafeConversion() this function only check the conversion
> validation for top level field. So validation of nested field search argument
> fails.
> static int findColumns(SchemaEvolution evolution,
> String columnName) {
> TypeDescription readerSchema = evolution.getReaderBaseSchema();
> List<String> fieldNames = readerSchema.getFieldNames();
> List<TypeDescription> children = readerSchema.getChildren();
> for (int i = 0; i < fieldNames.size(); ++i) {
> if (columnName.equals(fieldNames.get(i))) {
> TypeDescription result = evolution.getFileType(children.get(i));
> return result == null ? -1 : result.getId();
> }
> }
> return -1;
> }
> In findColumns() all the only top level column is referred. “Int2” is nested
> column due to which “-1” is return instead of index of “int2”.
> *1.2.4 Result -*
> PPD is not working for int2 field in the search argument.
> *1.3 Expected state - *
> *1.3.1 Schema*
> struct<int1:int, complex:struct<int2:int,String1:string>>
>
> *1.3.2 Query*
> Replacing Column name in PredicateLeaf with fully qualified column path.
>
> SearchArgument sarg = SearchArgumentFactory.newBuilder()
> .startAnd()
> .startNot()
> .lessThan(“complex.int2", PredicateLeaf.Type.LONG, 300000L)
> .end()
> .lessThan("complex.int2", PredicateLeaf.Type.LONG, 600000L)
> .end()
> .build();
>
> *1.3.3 Pushdown Predicate support in Nested field*
> https://github.com/apache/orc/pull/232
> *1.3.4 Result*
> PPD is working for complex.int2 field in the search argument.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)