[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11123: ARROW-13655: [C++][Parquet] Disable Thrift message size protections

GitBox Tue, 14 Sep 2021 00:36:30 -0700


jorisvandenbossche commented on a change in pull request #11123:
URL: https://github.com/apache/arrow/pull/11123#discussion_r707995055




##########
File path: python/pyarrow/tests/parquet/test_metadata.py
##########
@@ -496,3 +496,29 @@ def test_parquet_metadata_empty_to_dict(tempdir):
     assert len(metadata_dict["row_groups"]) == 1
     assert len(metadata_dict["row_groups"][0]["columns"]) == 1
     assert metadata_dict["row_groups"][0]["columns"][0]["statistics"] is None
+
+
[email protected]
[email protected]_memory
+def test_metadata_exceeds_message_size():
+    # ARROW-13655: Thrift may enable a defaut message size that limits
+    # the size of Parquet metadata that can be written.
+    NCOLS = 1000
+    NREPEATS = 4000
+
+    table = pa.table({str(i): np.random.randn(10) for i in range(NCOLS)})
+
+    with pa.BufferOutputStream() as out:
+        pq.write_table(table, out)
+        buf = out.getvalue()
+
+    original_metadata = pq.read_metadata(pa.BufferReader(buf))
+    metadata = pq.read_metadata(pa.BufferReader(buf))
+    for i in range(NREPEATS):
+        metadata.append_row_groups(original_metadata)

Review comment:
       BTW, the reason that the reproducer is reading in the metadata twice, 
and then appending the one many times to the other, is because appending an 
object to itself is buggy -> https://issues.apache.org/jira/browse/ARROW-13654




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11123: ARROW-13655: [C++][Parquet] Disable Thrift message size protections

Reply via email to