westonpace commented on a change in pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#discussion_r666602953
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
import pyarrow.parquet as pq
metadata_collector = []
- for i in range(4):
- table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
- pq.write_to_dataset(
- table, str(root_path), metadata_collector=metadata_collector
- )
+ f1_vals = [item for chunk in range(4) for item in [chunk] * 10]
+ f2_vals = [item*10 for chunk in range(4) for item in [chunk] * 10]
+
+ table = pa.table({'f1': f1_vals, 'f2': f2_vals})
+ pq.write_to_dataset(
+ table, str(root_path), partition_cols=['f1'],
+ use_legacy_dataset=use_legacy_dataset,
+ metadata_collector=metadata_collector
+ )
Review comment:
I also ended up spinning myself around a few times here. What I wanted
is a test to ensure that we can round trip table -> _metadata -> factory ->
dataset -> table using the metadata_collector. However, changing
use_legacy_dataset here fails (since the new dataset doesn't support append,
a.k.a ARROW-12358). An, both use_legacy_dataset versions fail when there is a
partition because of ARROW-13269.
So I created a new, simple, test which tests the round trip with a single
file and no append and no partitioning. Tests for the other scenarios can be
fleshed out when those issues are addressed. I then restored all the old tests
as they were. Hopefully this works.
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
Review comment:
Added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]