This is an automated email from the ASF dual-hosted git repository. apitrou pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push: new d79a010 Add large_string_map data file (#38) d79a010 is described below commit d79a0101d90dfa3bbb10337626f57a3e8c4b5363 Author: Arthur Passos <arthur...@outlook.com> AuthorDate: Wed Jun 21 14:01:14 2023 -0300 Add large_string_map data file (#38) * add chunked_string_map data file * use BROTLI compression for greater space saving * add description * correct arrow type name * rename file as suggested by reviewers * update readme as suggested * rename in docs as well * Make wording more precise, remove Arrow vocabulary * Add description of how the file was generated * Add link to paragraph --------- Co-authored-by: Antoine Pitrou <pit...@free.fr> --- data/README.md | 19 +++++++++++++++++++ data/large_string_map.brotli.parquet | Bin 0 -> 4325 bytes 2 files changed, 19 insertions(+) diff --git a/data/README.md b/data/README.md index 638f0d1..27c381a 100644 --- a/data/README.md +++ b/data/README.md @@ -44,6 +44,7 @@ | rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC | | plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC | | rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC | +| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below | TODO: Document what each file is in the table above. @@ -202,3 +203,21 @@ metadata.row_group(0).column(0) # total_compressed_size: 84 # total_uncompressed_size: 80 ``` + +## Large string map + +The file `large_string_map.brotli.parquet` was generated with: +```python +import pyarrow as pa +import pyarrow.parquet as pq + +arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32())) +arr = pa.chunked_array([arr, arr]) +tab = pa.table({ "arr": arr }) + +pq.write_table(tab, "test.parquet", compression='BROTLI') +``` + +It is meant to exercise reading of structured data where each value +is smaller than 2GB but the combined uncompressed column chunk size +is greater than 2GB. diff --git a/data/large_string_map.brotli.parquet b/data/large_string_map.brotli.parquet new file mode 100644 index 0000000..fc5c8b2 Binary files /dev/null and b/data/large_string_map.brotli.parquet differ