This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 20d5c310b1 GH-23870: [Python] Ensure parquet.write_to_dataset doesn't 
create empty files for non-observed dictionary (category) values (#36465)
20d5c310b1 is described below

commit 20d5c310b119b284ce4eeb9d46e2abf59e7017a1
Author: Joris Van den Bossche <[email protected]>
AuthorDate: Wed Jul 5 11:09:59 2023 +0200

    GH-23870: [Python] Ensure parquet.write_to_dataset doesn't create empty 
files for non-observed dictionary (category) values (#36465)
    
    ### What changes are included in this PR?
    
    If we partition on a categorical variable with "unobserved" categories 
(values present in the dictionary, but not in the actual data), the legacy path 
in `pq.write_to_dataset` currently creates empty files. The new dataset-based 
path already has the preferred behavior, and this PR fixes it for the legacy 
path and adds a test for both as well.
    
    This also fixes one of the pandas deprecation warnings listed in 
https://github.com/apache/arrow/issues/36412
    
    ### Are these changes tested?
    
    Yes
    
    ### Are there any user-facing changes?
    
    Yes, this no longer creates a hive-style directory with one empty file 
(parquet file with 0 rows) when users have unobserved categories. However, this 
aligns the legacy path with the new and default dataset-based path.
    * Closes: #23870
    
    Authored-by: Joris Van den Bossche <[email protected]>
    Signed-off-by: Joris Van den Bossche <[email protected]>
---
 python/pyarrow/parquet/core.py               |  2 +-
 python/pyarrow/tests/parquet/test_dataset.py | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 3fa0d6dadd..0675a2c9cc 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -3468,7 +3468,7 @@ def write_to_dataset(table, root_path, 
partition_cols=None,
         if len(partition_keys) == 1:
             partition_keys = partition_keys[0]
 
-        for keys, subgroup in data_df.groupby(partition_keys):
+        for keys, subgroup in data_df.groupby(partition_keys, observed=True):
             if not isinstance(keys, tuple):
                 keys = (keys,)
             subdir = '/'.join(
diff --git a/python/pyarrow/tests/parquet/test_dataset.py 
b/python/pyarrow/tests/parquet/test_dataset.py
index d8b97afeb6..c9a0c63eb1 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -1932,3 +1932,24 @@ def test_write_to_dataset_kwargs_passed(tempdir, 
write_dataset_kwarg):
         pq.write_to_dataset(table, path, **{key: arg})
         _name, _args, kwargs = mock_write_dataset.mock_calls[0]
         assert kwargs[key] == arg
+
+
[email protected]
+@parametrize_legacy_dataset
+def test_write_to_dataset_category_observed(tempdir, use_legacy_dataset):
+    # if we partition on a categorical variable with "unobserved" categories
+    # (values present in the dictionary, but not in the actual data)
+    # ensure those are not creating empty files/directories
+    df = pd.DataFrame({
+        "cat": pd.Categorical(["a", "b", "a"], categories=["a", "b", "c"]),
+        "col": [1, 2, 3]
+    })
+    table = pa.table(df)
+    path = tempdir / "dataset"
+    pq.write_to_dataset(
+        table, tempdir / "dataset", partition_cols=["cat"],
+        use_legacy_dataset=use_legacy_dataset
+    )
+    subdirs = [f.name for f in path.iterdir() if f.is_dir()]
+    assert len(subdirs) == 2
+    assert "cat=c" not in subdirs

Reply via email to