This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git


The following commit(s) were added to refs/heads/master by this push:
     new 1ba3447  File containing a Map schema without explicitly required key 
(#47)
1ba3447 is described below

commit 1ba34478f535c89382263c42c675a9af4f57f2dd
Author: Pieter Raubenheimer <[email protected]>
AuthorDate: Tue Apr 16 15:30:45 2024 +0100

    File containing a Map schema without explicitly required key (#47)
---
 data/README.md                    |  39 ++++++++++++++++++++++++++++++++++++++
 data/incorrect_map_schema.parquet | Bin 0 -> 595 bytes
 2 files changed, 39 insertions(+)

diff --git a/data/README.md b/data/README.md
index f805c8b..2782a93 100644
--- a/data/README.md
+++ b/data/README.md
@@ -50,6 +50,7 @@
 | float16_zeros_and_nans.parquet    | Float16 (logical type) column with NaNs 
and zeros as min/max values. . See [note](#float16-files) below |
 | concatenated_gzip_members.parquet     | 513 UINT64 numbers compressed using 
2 concatenated gzip members in a single data page |
 | byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` 
encoding. See [note](#byte-stream-split) below |
+| incorrect_map_schema.parquet | Contains a Map schema without explicitly 
required keys, produced by Presto. See [note](#incorrect-map-schema) |
 
 TODO: Document what each file is in the table above.
 
@@ -387,3 +388,41 @@ To check conformance of a `BYTE_STREAM_SPLIT` decoder, 
read each
 `BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
 the values from the corresponding `PLAIN`-encoded column. The values should
 be equal.
+
+## Incorrect Map Schema
+
+A number of producers, such as Presto/Trino/Athena, have been creating files 
with schemas
+where the Map key fields are marked as optional rather than required.
+This is not spec-compliant, yet appears in a number of existing data files in 
the wild.
+
+This issue has been fixed in:
+- [Trino 
v386+](https://github.com/trinodb/trino/commit/3247bd2e64d7422bd13e805cd67cfca3fa8ba520)
 
+- [Presto 
v0.274+](https://github.com/prestodb/presto/commit/842b46972c11534a7729d0a18e3abc5347922d1a)
  
+
+We can recreate these problematic files for testing [arrow-rs 
#5630](https://github.com/apache/arrow-rs/pull/5630)
+with relevant Presto/Trino CLI, or with AWS Athena Console:
+
+```sql
+CREATE TABLE my_catalog.my_table_name WITH (format = 'Parquet') AS (
+    SELECT MAP (
+        ARRAY['name', 'parent'],
+        ARRAY[
+            'report',
+            'another'
+        ]
+    ) my_map
+)
+```
+
+The schema in the created file is:
+
+```
+message hive_schema {
+  OPTIONAL group my_map (MAP) {
+    REPEATED group key_value (MAP_KEY_VALUE) {
+      OPTIONAL BYTE_ARRAY key (STRING);
+      OPTIONAL BYTE_ARRAY value (STRING);
+    }
+  }
+}
+```
\ No newline at end of file
diff --git a/data/incorrect_map_schema.parquet 
b/data/incorrect_map_schema.parquet
new file mode 100644
index 0000000..62102f0
Binary files /dev/null and b/data/incorrect_map_schema.parquet differ

Reply via email to