[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

ASF GitHub Bot (Jira) Sat, 12 Feb 2022 07:44:06 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491375#comment-17491375
 ]


ASF GitHub Bot commented on PARQUET-2120:
-----------------------------------------

rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805173968



##########
File path: 
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, 
column, page.getCompressedSize());
-        for (int i = 0; i <= dict.getMaxId(); i += 1) {
-          switch(type.getPrimitiveTypeName()) {
-            case BINARY:
-              if (type.getLogicalTypeAnnotation() instanceof 
LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
-                console.info("{}: {}", String.format("%6d", i),
-                    
Util.humanReadable(dict.decodeToBinary(i).toStringUsingUTF8(), 70));
-              } else {
-                console.info("{}: {}", String.format("%6d", i),
-                    
Util.humanReadable(dict.decodeToBinary(i).getBytesUnsafe(), 70));
-              }
-              break;
-            case INT32:
-              console.info("{}: {}", String.format("%6d", i),
-                dict.decodeToInt(i));
-              break;
-            case INT64:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToLong(i));
-              break;
-            case FLOAT:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToFloat(i));
-              break;
-            case DOUBLE:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToDouble(i));
-              break;
-            default:
-              throw new IllegalArgumentException(
-                  "Unknown dictionary type: " + type.getPrimitiveTypeName());
-          }
+        if (page != null) {
+          console.info("\nRow group {} dictionary for \"{}\":", rowGroup, 
column);
+          Dictionary dict = page.getEncoding().initDictionary(descriptor, 
page);
+          printDictionary(dict, type);
+        } else {
+          console.info("\nRow group {} has no dictionary for \"{}\"", 
rowGroup, column);

Review comment:
       For a file mixing pages with and without dictionary encoding the output 
would look e.g. like this:
   ```
   Row group 0 has no dictionary for "col"
   
   Row group 1 dictionary for "col":
        0: "b"
        1: "c"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> parquet-cli dictionary command fails on pages without dictionary encoding
> -------------------------------------------------------------------------
>
>                 Key: PARQUET-2120
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2120
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cli
>    Affects Versions: 1.12.2
>            Reporter: Willi Raschkowski
>            Priority: Minor
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet                
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>       at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>       at org.apache.parquet.cli.Main.run(Main.java:155)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>       at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet      
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> --------------------------------------------------------------------------------
>      type      encodings count     avg size   nulls   min / max
> col  BINARY    S   _     1         46.00 B    0       "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> --------------------------------------------------------------------------------
>      type      encodings count     avg size   nulls   min / max
> col  BINARY    S _ R     200       0.34 B     0       "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

Reply via email to