[parquet-testing] branch master updated: Add large_string_map data file (#38)

apitrou Wed, 21 Jun 2023 10:01:26 -0700

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git



The following commit(s) were added to refs/heads/master by this push:
     new d79a010  Add large_string_map data file (#38)
d79a010 is described below

commit d79a0101d90dfa3bbb10337626f57a3e8c4b5363
Author: Arthur Passos <arthur...@outlook.com>
AuthorDate: Wed Jun 21 14:01:14 2023 -0300

    Add large_string_map data file (#38)
    
    * add chunked_string_map data file
    
    * use BROTLI compression for greater space saving
    
    * add description
    
    * correct arrow type name
    
    * rename file as suggested by reviewers
    
    * update readme as suggested
    
    * rename in docs as well
    
    * Make wording more precise, remove Arrow vocabulary
    
    * Add description of how the file was generated
    
    * Add link to paragraph
    
    ---------
    
    Co-authored-by: Antoine Pitrou <pit...@free.fr>
---
 data/README.md                       |  19 +++++++++++++++++++
 data/large_string_map.brotli.parquet | Bin 0 -> 4325 bytes
 2 files changed, 19 insertions(+)

diff --git a/data/README.md b/data/README.md
index 638f0d1..27c381a 100644
--- a/data/README.md
+++ b/data/README.md
@@ -44,6 +44,7 @@
 | rle-dict-snappy-checksum.parquet                 | compressed and 
dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and 
dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and 
dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC 
|
+| large_string_map.brotli.parquet       | MAP(STRING, INT32) with a string 
column chunk of more than 2GB. See [note](#large-string-map) below |
 
 TODO: Document what each file is in the table above.
 
@@ -202,3 +203,21 @@ metadata.row_group(0).column(0)
 #   total_compressed_size: 84
 #   total_uncompressed_size: 80
 ```
+
+## Large string map
+
+The file `large_string_map.brotli.parquet` was generated with:
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
+arr = pa.chunked_array([arr, arr])
+tab = pa.table({ "arr": arr })
+
+pq.write_table(tab, "test.parquet", compression='BROTLI')
+```
+
+It is meant to exercise reading of structured data where each value
+is smaller than 2GB but the combined uncompressed column chunk size
+is greater than 2GB.
diff --git a/data/large_string_map.brotli.parquet 
b/data/large_string_map.brotli.parquet
new file mode 100644
index 0000000..fc5c8b2
Binary files /dev/null and b/data/large_string_map.brotli.parquet differ

[parquet-testing] branch master updated: Add large_string_map data file (#38)

Reply via email to