[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

ASF GitHub Bot (Jira) Sun, 19 Jan 2020 23:33:29 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019275#comment-17019275
 ]


ASF GitHub Bot commented on DRILL-7509:
---------------------------------------

KazydubB commented on pull request #1954: DRILL-7509: Incorrect TupleSchema is 
created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366847315
 
 

 ##########
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MetadataUtils.java
 ##########
 @@ -187,6 +187,10 @@ public static ColumnMetadata newMapArray(String name, 
TupleMetadata schema) {
     return new MapColumnMetadata(name, DataMode.REPEATED, (TupleSchema) 
schema);
   }
 
+  public static DictColumnMetadata newDictArray(String name, TupleMetadata 
schema) {
+    return new DictColumnMetadata(name, DataMode.REPEATED, (TupleSchema) 
schema);
 
 Review comment:
   @paul-rogers, let me explain: Dict stores `key` and `value` 
`ColumnMetadata`s in `TupleSchema` - the way Map's members are stored - but 
there's a validation for each of the fields (name, type). Dict does not contain 
a Map, it stores its `key` and `value` in its `TupleSchema schema` field.
   
   When `TupleMetadata` is constructed for Parquet table as part of table 
metadata, we loop over each Parquet field's `SchemaPath` representation (leaf 
fields, e.g. `` `mapcol`.`map`.`key` ``, `` `structcol`.`b` `` with the last 
field in schema being a primitive). Such named segments are treated as either a 
(Drill's) `MAP` or `DICT`, depending on parent segment type.
   
   (Parquet's) `MAP` is represented as a group (note, that nested group's name, 
`key_value` below, can be different, based on the system which produced the 
Parquet file):
   ```
   <map-repetition> group <name> (MAP) {
     repeated group key_value {
       required <key-type> key;
       <value-repetition> <value-type> value;
     }
   }
   ```
   and before changes in the PR, when `TupleMetadata` was being created for the 
table, if `DICT` column was encountered, it included a nested `key_value` group 
as Drill's `MAP` which then contained `key` and `value` fields. Thus, there is 
a need to skip this segment if we know that its parent's type is `DICT` to have 
correct `ColumnMetadata` for the `DICT` field.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-7509
>                 URL: https://issues.apache.org/jira/browse/DRILL-7509
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.16.0
>            Reporter: Bohdan Kazydub
>            Assignee: Bohdan Kazydub
>            Priority: Major
>             Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

Reply via email to