This is an automated email from the ASF dual-hosted git repository.
jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 20d5c310b1 GH-23870: [Python] Ensure parquet.write_to_dataset doesn't
create empty files for non-observed dictionary (category) values (#36465)
20d5c310b1 is described below
commit 20d5c310b119b284ce4eeb9d46e2abf59e7017a1
Author: Joris Van den Bossche <[email protected]>
AuthorDate: Wed Jul 5 11:09:59 2023 +0200
GH-23870: [Python] Ensure parquet.write_to_dataset doesn't create empty
files for non-observed dictionary (category) values (#36465)
### What changes are included in this PR?
If we partition on a categorical variable with "unobserved" categories
(values present in the dictionary, but not in the actual data), the legacy path
in `pq.write_to_dataset` currently creates empty files. The new dataset-based
path already has the preferred behavior, and this PR fixes it for the legacy
path and adds a test for both as well.
This also fixes one of the pandas deprecation warnings listed in
https://github.com/apache/arrow/issues/36412
### Are these changes tested?
Yes
### Are there any user-facing changes?
Yes, this no longer creates a hive-style directory with one empty file
(parquet file with 0 rows) when users have unobserved categories. However, this
aligns the legacy path with the new and default dataset-based path.
* Closes: #23870
Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
---
python/pyarrow/parquet/core.py | 2 +-
python/pyarrow/tests/parquet/test_dataset.py | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 3fa0d6dadd..0675a2c9cc 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -3468,7 +3468,7 @@ def write_to_dataset(table, root_path,
partition_cols=None,
if len(partition_keys) == 1:
partition_keys = partition_keys[0]
- for keys, subgroup in data_df.groupby(partition_keys):
+ for keys, subgroup in data_df.groupby(partition_keys, observed=True):
if not isinstance(keys, tuple):
keys = (keys,)
subdir = '/'.join(
diff --git a/python/pyarrow/tests/parquet/test_dataset.py
b/python/pyarrow/tests/parquet/test_dataset.py
index d8b97afeb6..c9a0c63eb1 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -1932,3 +1932,24 @@ def test_write_to_dataset_kwargs_passed(tempdir,
write_dataset_kwarg):
pq.write_to_dataset(table, path, **{key: arg})
_name, _args, kwargs = mock_write_dataset.mock_calls[0]
assert kwargs[key] == arg
+
+
[email protected]
+@parametrize_legacy_dataset
+def test_write_to_dataset_category_observed(tempdir, use_legacy_dataset):
+ # if we partition on a categorical variable with "unobserved" categories
+ # (values present in the dictionary, but not in the actual data)
+ # ensure those are not creating empty files/directories
+ df = pd.DataFrame({
+ "cat": pd.Categorical(["a", "b", "a"], categories=["a", "b", "c"]),
+ "col": [1, 2, 3]
+ })
+ table = pa.table(df)
+ path = tempdir / "dataset"
+ pq.write_to_dataset(
+ table, tempdir / "dataset", partition_cols=["cat"],
+ use_legacy_dataset=use_legacy_dataset
+ )
+ subdirs = [f.name for f in path.iterdir() if f.is_dir()]
+ assert len(subdirs) == 2
+ assert "cat=c" not in subdirs