Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

via GitHub Tue, 19 Dec 2023 01:26:57 -0800


jorisvandenbossche commented on code in PR #39112:
URL: https://github.com/apache/arrow/pull/39112#discussion_r1431112710



##########
python/pyarrow/tests/parquet/test_compliant_nested_type.py:
##########
@@ -83,21 +79,18 @@ def test_write_compliant_nested_type_enable(tempdir,
     assert new_table.schema.types[0].value_field.name == 'element'
 
     # Verify that the new table can be read/written correctly
-    _check_roundtrip(new_table,
-                     use_legacy_dataset=use_legacy_dataset)
+    _check_roundtrip(new_table)
 
 
 @pytest.mark.pandas
-@parametrize_legacy_dataset
[email protected]

Review Comment:
   Is dataset mark needed here?



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -752,17 +682,15 @@ def test_fastparquet_cross_compatibility(tempdir):
     tm.assert_frame_equal(table_fp.to_pandas(), df)
 
 
-@parametrize_legacy_dataset
 @pytest.mark.parametrize('array_factory', [
     lambda: pa.array([0, None] * 10),
     lambda: pa.array([0, None] * 10).dictionary_encode(),
     lambda: pa.array(["", None] * 10),
     lambda: pa.array(["", None] * 10).dictionary_encode(),
 ])
[email protected]('use_dictionary', [False, True])

Review Comment:
   Is there a reason this parametrization was also removed?



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -1388,16 +1380,13 @@ def _create_dataset_all_types(tempdir, chunk_size=None):
     path = str(tempdir / "test_parquet_dataset_all_types")
 
     # write_to_dataset currently requires pandas

Review Comment:
   This comment is no longer correct I think? (so that might also mean that all 
tests that use this helper no longer need the pandas mark)



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -48,8 +47,8 @@
 
 
 @pytest.mark.pandas
-@parametrize_legacy_dataset
-def test_pandas_parquet_datetime_tz(use_legacy_dataset):
[email protected]

Review Comment:
   Same here and the one below about the dataset marker



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -986,31 +699,24 @@ def read_multiple_files(paths, columns=None, 
use_threads=True, **kwargs):
     t = pa.Table.from_pandas(bad_apple)
     _write_table(t, bad_apple_path)
 
-    if not use_legacy_dataset:
-        # TODO(dataset) Dataset API skips bad files
-        return
+    # TODO(dataset) Dataset API skips bad files

Review Comment:
   Do we know if there is an issue about this?



##########
python/pyarrow/tests/parquet/test_pandas.py:
##########
@@ -497,15 +447,14 @@ def 
test_categorical_index_survives_roundtrip(use_legacy_dataset):
     table = pa.Table.from_pandas(df)
     bos = pa.BufferOutputStream()
     pq.write_table(table, bos)
-    ref_df = pq.read_pandas(
-        bos.getvalue(), use_legacy_dataset=use_legacy_dataset).to_pandas()
+    ref_df = pq.read_pandas(bos.getvalue()).to_pandas()
     assert isinstance(ref_df.index, pd.CategoricalIndex)
     assert ref_df.index.equals(df.index)
 
 
 @pytest.mark.pandas
-@parametrize_legacy_dataset
-def test_categorical_order_survives_roundtrip(use_legacy_dataset):
[email protected]

Review Comment:
   I suppose that many of those tests also don't need the dataset mark



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -938,46 +671,26 @@ def test_read_multiple_files(tempdir, use_legacy_dataset):
     (dirpath / '_SUCCESS.crc').touch()
 
     def read_multiple_files(paths, columns=None, use_threads=True, **kwargs):
-        dataset = pq.ParquetDataset(
-            paths, use_legacy_dataset=use_legacy_dataset, **kwargs)
+        dataset = pq.ParquetDataset(paths, **kwargs)
         return dataset.read(columns=columns, use_threads=use_threads)
 
     result = read_multiple_files(paths)
     expected = pa.concat_tables(test_data)
 
     assert result.equals(expected)
 
-    # Read with provided metadata
-    # TODO(dataset) specifying metadata not yet supported

Review Comment:
   We actually do support that through a different dataset API 
(`pyarrow.dataset.parquet_dataset`), but that is not wired up in the parquet 
API. Given that we never got any complaints about this, I suppose fine to just 
remove (and it should already be tested in the dataset tests, so should be fine 
to remove the tests here as well)



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1123,18 +819,14 @@ def _make_example_multifile_dataset(base_path, 
nfiles=10, file_nrows=5):
     return paths
 
 
-def _assert_dataset_paths(dataset, paths, use_legacy_dataset):
-    if use_legacy_dataset:
-        assert set(map(str, paths)) == {x.path for x in dataset._pieces}
-    else:
-        paths = [str(path.as_posix()) for path in paths]
-        assert set(paths) == set(dataset._dataset.files)
+def _assert_dataset_paths(dataset, paths):
+    paths = [str(path.as_posix()) for path in paths]
+    assert set(paths) == set(dataset._dataset.files)

Review Comment:
   ```suggestion
       assert set(paths) == set(dataset.files)
   ```
   
   (this is exposed nowadays, so we can use the public attribute here in the 
tests as well)



##########
python/pyarrow/tests/parquet/test_pandas.py:
##########
@@ -150,52 +144,22 @@ def 
test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written(
     tm.assert_frame_equal(df, df_read)
 
 
-# TODO(dataset) duplicate column selection actually gives duplicate columns now

Review Comment:
   Do you know if this is actually tested elsewhere? (that you get duplicate 
columns)



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -752,17 +682,15 @@ def test_fastparquet_cross_compatibility(tempdir):
     tm.assert_frame_equal(table_fp.to_pandas(), df)
 
 
-@parametrize_legacy_dataset
 @pytest.mark.parametrize('array_factory', [
     lambda: pa.array([0, None] * 10),
     lambda: pa.array([0, None] * 10).dictionary_encode(),
     lambda: pa.array(["", None] * 10),
     lambda: pa.array(["", None] * 10).dictionary_encode(),
 ])
[email protected]('use_dictionary', [False, True])

Review Comment:
   Ah, because it was just not used in practice



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1230,29 +915,22 @@ def test_ignore_custom_prefixes(tempdir, 
use_legacy_dataset):
                         partition_cols=['_part'])
 
     read = pq.read_table(
-        tempdir, use_legacy_dataset=use_legacy_dataset,
-        ignore_prefixes=['_private'])
+        tempdir, ignore_prefixes=['_private'])
 
     assert read.equals(table)
 
 
-@parametrize_legacy_dataset_fixed
-def test_empty_directory(tempdir, use_legacy_dataset):
-    # ARROW-5310 - reading empty directory
-    # fails with legacy implementation
+def test_empty_directory(tempdir):

Review Comment:
   ```suggestion
   def test_empty_directory(tempdir):
       # ARROW-5310 - reading empty directory
   ```
   Maybe keep the issue reference (that's still relevant for the dataset case, 
as that was the reason this test was introduced)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

Reply via email to