jupiter opened a new issue, #5606: URL: https://github.com/apache/arrow-rs/issues/5606
**Describe the bug** The [Parquet Format](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) specifies that: > The `key` field encodes the map's key type. This field must have repetition `required` and must always be present. I.e. ``` <map-repetition> group <name> (MAP) { repeated group key_value { required <key-type> key; <value-repetition> <value-type> value; } } ``` However, most implementations of the format do not appear to enforce this in the Thrift schema, and producers such as Hive/Spark/Presto/Trino/AWS Athena do not produce Parquet files like this. A huge number of such files are widely found on data lakes everywhere, and rewriting such files in order to comply with this does not seem feasable. **To Reproduce** ```rs let url_str = "https://overturemaps-us-west-2.s3.us-west-2.amazonaws.com/release/2023-07-26-alpha.0/theme%3Dtransportation/type%3Dconnector/20230726_134827_00007_dg6b6_01b086fc-f35b-487c-8d4e-5cdbbdc1785d"; let url = Url::parse(url_str).unwrap(); let storage_container = Arc::new(HttpBuilder::new().with_url(url).build().unwrap()); let location = Path::from(""); let meta = storage_container.head(&location).await.unwrap(); let mut reader = ParquetObjectReader::new(storage_container, meta); // Parquet schema can be printed let mut p_metadata = reader.get_metadata().await.unwrap(); print_schema(&mut std::io::stdout(), p_metadata.file_metadata().schema()); // Metadata cannot be loaded let metadata = ArrowReaderMetadata::load_async(&mut reader, Default::default()) .await .unwrap(); ``` Results in: ``` thread 'main' panicked at src/main.rs:79:10: called `Result::unwrap()` on an `Err` value: ArrowError("Map keys must be required") ``` **Expected behavior** Map keys are assumed to be required, regardless of explicit specification in the Thrift schema, and data is read accordingly. **Additional context** This has come up in a PyArrow issue: https://github.com/apache/arrow/issues/37389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
