jupiter opened a new issue, #5606:
URL: https://github.com/apache/arrow-rs/issues/5606

   **Describe the bug**
   
   The [Parquet 
Format](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps)
 specifies that:
   > The `key` field encodes the map's key type. This field must have 
repetition `required` and must always be present.
   
   I.e. 
   
   ```
   <map-repetition> group <name> (MAP) {
     repeated group key_value {
       required <key-type> key;
       <value-repetition> <value-type> value;
     }
   }
   ```
   
   However, most implementations of the format do not appear to enforce this in 
the Thrift schema, and producers such as Hive/Spark/Presto/Trino/AWS Athena do 
not produce Parquet files like this. A huge number of such files are widely 
found on data lakes everywhere, and rewriting such files in order to comply 
with this does not seem feasable.
   
   **To Reproduce**
   
   ```rs
           let url_str = 
"https://overturemaps-us-west-2.s3.us-west-2.amazonaws.com/release/2023-07-26-alpha.0/theme%3Dtransportation/type%3Dconnector/20230726_134827_00007_dg6b6_01b086fc-f35b-487c-8d4e-5cdbbdc1785d";;
           let url = Url::parse(url_str).unwrap();
           let storage_container = 
Arc::new(HttpBuilder::new().with_url(url).build().unwrap());
           let location = Path::from("");
           let meta = storage_container.head(&location).await.unwrap();
           let mut reader = ParquetObjectReader::new(storage_container, meta);
   
           // Parquet schema can be printed
           let mut p_metadata = reader.get_metadata().await.unwrap();
           print_schema(&mut std::io::stdout(), 
p_metadata.file_metadata().schema());
   
           // Metadata cannot be loaded
           let metadata = ArrowReaderMetadata::load_async(&mut reader, 
Default::default())
               .await
               .unwrap();
   ```
   
   Results in:
   
   ```
   thread 'main' panicked at src/main.rs:79:10:
   called `Result::unwrap()` on an `Err` value: ArrowError("Map keys must be 
required")
   ```
   
   **Expected behavior**
   
   Map keys are assumed to be required, regardless of explicit specification in 
the Thrift schema, and data is read accordingly.
   
   **Additional context**
   
   This has come up in a PyArrow issue: 
https://github.com/apache/arrow/issues/37389
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to