acdha opened a new issue, #3095:
URL: https://github.com/apache/parquet-java/issues/3095

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Using Parquet CLI 1.15.0 via Mac Homebrew, I noticed some surprising 
behaviour with the `parquet-cli` and nested columns. 
   
   `parquet schema catalog.parquet` returns a schema showing the nested types 
(I've trimmed the field list slightly):
   
   ```avro
   {
     "type" : "record",
     "name" : "schema",
     "fields" : [ {
       "name" : "item_id",
       "type" : "string"
     }, {
       "name" : "title",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "language",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "subjects",
       "type" : [ "null", {
         "type" : "array",
         "items" : {
           "type" : "record",
           "name" : "list",
           "fields" : [ {
             "name" : "element",
             "type" : "string"
           } ]
         }
       } ],
       "default" : null
     }, {
       "name" : "authors",
       "type" : [ "null", {
         "type" : "array",
         "items" : {
           "type" : "record",
           "name" : "list",
           "namespace" : "list2",
           "fields" : [ {
             "name" : "element",
             "type" : "string"
           } ]
         }
       } ],
       "default" : null
     } ]
   }
   ```
   
   `parquet dictionary -c subjects.list.element catalog.parquet` will return 
the expected values for those fields as well:
   
   ```
   Row group 0 dictionary for "subjects.list.element":
        0: "Bestsellers"
        1: "Biography"
        2: "Fantasy Fiction"
        3: "Music Theory"
        4: "Disability"
        5: "Family"
        6: "Young Adult"
   ```
   
   However, when using `cat` or `head` to display the file contents those 
fields are displayed as null:
   
   ```
   {"bmc_id": "id1", "title": null, "language": "en", "subjects": null, 
"authors": null}
   {"bmc_id": "id2", "title": null, "language": "en", "subjects": null, 
"authors": null,}
   {"bmc_id": "id3", "title": null, "language": "en", "subjects": null, 
"authors": null}
   ```
   
   Other tools like PyArrow or Pandas do display those values as arrays. I 
created this as a bug because it _looks_ like it's working and if those fields 
are nullable, there's no way to tell whether the null value is correct.
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org

Reply via email to