[GitHub] [drill] paul-rogers commented on a change in pull request #1954: DRILL-7509: Incorrect TupleSchema is created for DICT column when querying Parquet files

GitBox Tue, 14 Jan 2020 14:31:14 -0800

paul-rogers commented on a change in pull request #1954: DRILL-7509: Incorrect 
TupleSchema is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366605680


 ##########
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##########
 @@ -63,6 +64,30 @@ public static ColumnMetadata getColumnMetadata(SchemaPath 
schemaPath, TupleMetad
     return colMetadata;
   }
 
+  /**
+   * Checks if field indetified by the schema path is child in either {@code 
DICT} or {@code REPEATED MAP}.
+   * For such fields, nested in {@code DICT} or {@code REPEATED MAP},
+   * filters can't be removed based on Parquet statistics.
+   * @param schemaPath schema path used in filter
+   * @param schema schema containing all the fields in the file
+   * @return {@literal true} if field is nested inside {@code DICT} (is {@code 
`key`} or {@code `value`})
+   *         or inside {@code REPEATED MAP} field, {@literal false} otherwise.
+   */
+  public static boolean isFieldNestedInDictOrRepeatedMap(SchemaPath 
schemaPath, TupleMetadata schema) {
 
 Review comment:
   Another general observation is that we've accidentally created multiple 
versions of the same code. The `RequestedTupleImpl` class does something very 
similar: it converts a `SchemaPath` into a consolidated projection set using 
logic much like we have here. For example, `SELECT a, a.b` is consolidated into 
a single column, `a`, that must be a MAP that must contain at least a `b` 
member. That code was very tricky to get right, as, I'm sure, this is. That 
code has to be modified to handle `SELECT a, a.key`. Is `a` now a `MAP` or a 
`DICT`? We don't know, all we know is that `a` is consistent with either 
interpretation.
   
   Do we really need multiple copies?
   
   Some things to consolidate:
   
   * `ColumnMetadata` - which should be our go-to solution as it is the most 
general.
   * `MaterializedField` - which has all kinds of holes and problems for 
complex types.
   * `SerializedField` - Like `MaterializedField` but with buffer lengths?
   * `RequestedColumn` - Part of that code mentioned above that parses a 
project list into a consolidated set of columns.
   * `SchemaPath` - A description of a project list.
   * The logic here
   * ... - Probably others.
   
   We probably want three tiers of representation:
   
   * Project list - `SchemaPath`
   * "Semanticized" project list - `RequestedTuple`
   * Column metadata - `ColumnMetadata`
   
   Then, we create one set of code that handles transforms and validations. 
(Convert project list to semantisized list, then validate that against a 
schema.) As it turns out, that is what the EVF does for scan projection 
planning; maybe we can pull that out and generalize it for use elsewhere?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1954: DRILL-7509: Incorrect TupleSchema is created for DICT column when querying Parquet files

Reply via email to