[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015792#comment-17015792
 ] 

ASF GitHub Bot commented on DRILL-7509:
---------------------------------------

ihuzenko commented on pull request #1954: DRILL-7509: Incorrect TupleSchema is 
created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366483635
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetTableMetadataUtils.java
 ##########
 @@ -507,29 +507,67 @@ private static long convertToDrillDateValue(int 
dateValue) {
       MetadataBase.ParquetTableMetadataBase parquetTableMetadata) {
     int precision = 0;
     int scale = 0;
-    int definitionLevel = 1;
-    int repetitionLevel = 0;
     MetadataVersion metadataVersion = new 
MetadataVersion(parquetTableMetadata.getMetadataVersion());
     // only ColumnTypeMetadata_v3 and ColumnTypeMetadata_v4 store information 
about scale, precision, repetition level and definition level
     if (parquetTableMetadata.hasColumnMetadata() && 
(metadataVersion.compareTo(new MetadataVersion(3, 0)) >= 0)) {
       scale = parquetTableMetadata.getScale(name);
       precision = parquetTableMetadata.getPrecision(name);
-      repetitionLevel = parquetTableMetadata.getRepetitionLevel(name);
-      definitionLevel = parquetTableMetadata.getDefinitionLevel(name);
-    }
-    TypeProtos.DataMode mode;
-    if (repetitionLevel >= 1) {
-      mode = TypeProtos.DataMode.REPEATED;
-    } else if (repetitionLevel == 0 && definitionLevel == 0) {
-      mode = TypeProtos.DataMode.REQUIRED;
-    } else {
-      mode = TypeProtos.DataMode.OPTIONAL;
     }
+
+    TypeProtos.DataMode mode = getDataMode(parquetTableMetadata, 
metadataVersion, name);
     return 
TypeProtos.MajorType.newBuilder(ParquetReaderUtility.getType(primitiveType, 
originalType, precision, scale))
         .setMode(mode)
         .build();
   }
 
+  /**
+   * Obtain data mode from table metadata for a column. Algorithm for 
retrieving data mode depends on metadata version:
+   * <ul>
+   *   <li>starting from version {@code 4.2}, Parquet's {@link 
org.apache.parquet.schema.Type.Repetition}
+   *   is stored in table metadata itself;</li>
+   *   <li>starting from {@code 3.0} to {@code 4.2} (exclusively) the data 
mode is
+   *   computed based on max {@code definition} and {@code repetition} levels
+   *   ({@link 
MetadataBase.ParquetTableMetadataBase#getDefinitionLevel(String[])} and
+   *   {@link 
MetadataBase.ParquetTableMetadataBase#getRepetitionLevel(String[])} 
respectively)
+   *   obtained from Parquet's schema;
+   *
+   *   <p><strong>Note:</strong> this computation may lead to erroneous 
results,
+   *   when there are few nesting levels.</p>
+   *   </li>
+   *   <li>prior to {@code 3.0} {@code DataMode.OPTIONAL} is returned.</li>
+   * </ul>
+   * @param tableMetadata Parquet table metadata
+   * @param metadataVersion version of Parquet table metadata
+   * @param name (leaf) column to obtain data mode for
+   * @return data mode of the specified column
+   */
+  private static TypeProtos.DataMode 
getDataMode(MetadataBase.ParquetTableMetadataBase tableMetadata,
 
 Review comment:
   Could you please simplify the method, for example: 
   ```java
       TypeProtos.DataMode mode = TypeProtos.DataMode.OPTIONAL;
       if (tableMetadata.hasColumnMetadata()) {
         if (metadataVersion.compareTo(new MetadataVersion(4, 2)) >= 0) {
           mode = 
ParquetReaderUtility.getDataMode(tableMetadata.getRepetition(name));
         } else if (metadataVersion.compareTo(new MetadataVersion(3, 0)) >= 0) {
           int repetitionLevel = tableMetadata.getRepetitionLevel(name);
           if (repetitionLevel >= 1) {
             mode = TypeProtos.DataMode.REPEATED;
           } else if (repetitionLevel == 0 && 
tableMetadata.getDefinitionLevel(name) == 0) {
             mode = TypeProtos.DataMode.REQUIRED;
           }
         }
       }
       return mode;
   ```
   Also, since we know that ```MetadataVersion``` is based on major & minor int 
numbers, I'd suggest providing a few useful methods to make comparisons easy to 
read. For example, we could use methods ```atLeast(major, minor)``` and 
```atMost(major, minor)``` similar to Range class. We need to analyze compare 
to usages and figure out which method would be useful to add. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-7509
>                 URL: https://issues.apache.org/jira/browse/DRILL-7509
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.16.0
>            Reporter: Bohdan Kazydub
>            Assignee: Bohdan Kazydub
>            Priority: Major
>             Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to