This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push:
new d79a010 Add large_string_map data file (#38)
d79a010 is described below
commit d79a0101d90dfa3bbb10337626f57a3e8c4b5363
Author: Arthur Passos <[email protected]>
AuthorDate: Wed Jun 21 14:01:14 2023 -0300
Add large_string_map data file (#38)
* add chunked_string_map data file
* use BROTLI compression for greater space saving
* add description
* correct arrow type name
* rename file as suggested by reviewers
* update readme as suggested
* rename in docs as well
* Make wording more precise, remove Arrow vocabulary
* Add description of how the file was generated
* Add link to paragraph
---------
Co-authored-by: Antoine Pitrou <[email protected]>
---
data/README.md | 19 +++++++++++++++++++
data/large_string_map.brotli.parquet | Bin 0 -> 4325 bytes
2 files changed, 19 insertions(+)
diff --git a/data/README.md b/data/README.md
index 638f0d1..27c381a 100644
--- a/data/README.md
+++ b/data/README.md
@@ -44,6 +44,7 @@
| rle-dict-snappy-checksum.parquet | compressed and
dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
| plain-dict-uncompressed-checksum.parquet | uncompressed and
dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and
dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC
|
+| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string
column chunk of more than 2GB. See [note](#large-string-map) below |
TODO: Document what each file is in the table above.
@@ -202,3 +203,21 @@ metadata.row_group(0).column(0)
# total_compressed_size: 84
# total_uncompressed_size: 80
```
+
+## Large string map
+
+The file `large_string_map.brotli.parquet` was generated with:
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
+arr = pa.chunked_array([arr, arr])
+tab = pa.table({ "arr": arr })
+
+pq.write_table(tab, "test.parquet", compression='BROTLI')
+```
+
+It is meant to exercise reading of structured data where each value
+is smaller than 2GB but the combined uncompressed column chunk size
+is greater than 2GB.
diff --git a/data/large_string_map.brotli.parquet
b/data/large_string_map.brotli.parquet
new file mode 100644
index 0000000..fc5c8b2
Binary files /dev/null and b/data/large_string_map.brotli.parquet differ