jorisvandenbossche commented on a change in pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#discussion_r665539346
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
import pyarrow.parquet as pq
metadata_collector = []
- for i in range(4):
- table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
- pq.write_to_dataset(
- table, str(root_path), metadata_collector=metadata_collector
- )
+ f1_vals = [item for chunk in range(4) for item in [chunk] * 10]
+ f2_vals = [item*10 for chunk in range(4) for item in [chunk] * 10]
+
+ table = pa.table({'f1': f1_vals, 'f2': f2_vals})
+ pq.write_to_dataset(
+ table, str(root_path), partition_cols=['f1'],
+ use_legacy_dataset=use_legacy_dataset,
+ metadata_collector=metadata_collector
+ )
Review comment:
Actually, this way it is of course not using the write_to_dataset
non-legacy version, so if we are not doing that, we could also keep the
original without passing the `use_legacy_dataset` keyword.
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
import pyarrow.parquet as pq
metadata_collector = []
- for i in range(4):
- table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
- pq.write_to_dataset(
- table, str(root_path), metadata_collector=metadata_collector
- )
+ f1_vals = [item for chunk in range(4) for item in [chunk] * 10]
+ f2_vals = [item*10 for chunk in range(4) for item in [chunk] * 10]
+
+ table = pa.table({'f1': f1_vals, 'f2': f2_vals})
+ pq.write_to_dataset(
+ table, str(root_path), partition_cols=['f1'],
+ use_legacy_dataset=use_legacy_dataset,
+ metadata_collector=metadata_collector
+ )
Review comment:
```suggestion
root_path.mkdir()
for i in range(4):
table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
path = root_path / f"part-{i}.parquet"
pq.write_table(table, str(path),
metadata_collector=metadata_collector)
# set the file path relative to the root of the partitioned dataset
metadata_collector[-1].set_file_path(f"part-{i}.parquet")
```
Looking a bit more in detail, the goal of this helper method is actually to
create a non-partitioned dataset (so just a directory with flat files), since
the partitioned case is tested a bit more below with the helper function
`_create_parquet_dataset_partitioned`. So my code suggestion here adapts it to
use plain `write_table` (to overcome the issue that `write_to_dataset`
generates identical files (and thus overwrites)
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,56 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset):
import pyarrow.parquet as pq
metadata_collector = []
- for i in range(4):
- table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
- pq.write_to_dataset(
- table, str(root_path), metadata_collector=metadata_collector
- )
+ f1_vals = [item for chunk in range(4) for item in [chunk] * 10]
+
+ table = pa.table({'f1': f1_vals, 'f2': np.random.randn(40)})
+ pq.write_to_dataset(
+ table, str(root_path), partition_cols=['f1'],
+ use_legacy_dataset=use_legacy_dataset,
+ metadata_collector=metadata_collector
+ )
+
+ partitionless_schema = pa.schema([pa.field('f2', pa.float64())])
metadata_path = str(root_path / '_metadata')
# write _metadata file
pq.write_metadata(
- table.schema, metadata_path,
+ partitionless_schema, metadata_path,
Review comment:
With my suggestion above, those changes can all be reverted, I think (it
was changing the intent of the tests)
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
Review comment:
```suggestion
def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
"""
Creates a simple (flat files, no nested partitioning) Parquet dataset
"""
```
(to clarify the intent)
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2672,47 +2672,57 @@ def test_feather_format(tempdir, dataset_reader):
dataset_reader.to_table(ds.dataset(basedir, format="feather"))
-def _create_parquet_dataset_simple(root_path):
+def _create_parquet_dataset_simple(root_path, use_legacy_dataset=True):
import pyarrow.parquet as pq
metadata_collector = []
- for i in range(4):
- table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
- pq.write_to_dataset(
- table, str(root_path), metadata_collector=metadata_collector
- )
+ f1_vals = [item for chunk in range(4) for item in [chunk] * 10]
+ f2_vals = [item*10 for chunk in range(4) for item in [chunk] * 10]
+
+ table = pa.table({'f1': f1_vals, 'f2': f2_vals})
+ pq.write_to_dataset(
+ table, str(root_path), partition_cols=['f1'],
+ use_legacy_dataset=use_legacy_dataset,
+ metadata_collector=metadata_collector
+ )
Review comment:
But so there is a `_create_parquet_dataset_partitioned` a little bit
below, and I think it is that one that should be parametrized with
`use_legacy_dataset=True/False`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]