This is an automated email from the ASF dual-hosted git repository.
AlenkaF pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 23cd1ff8f4 GH-43574: [Python][Parquet] do not add partition columns
from file path when reading single file (#49853)
23cd1ff8f4 is described below
commit 23cd1ff8f4e33b3207875e3395d2d6b1aeb1edc2
Author: bkurtz <[email protected]>
AuthorDate: Wed May 6 02:03:24 2026 -0700
GH-43574: [Python][Parquet] do not add partition columns from file path
when reading single file (#49853)
Fixes #43574 by reverting a small portion of
bd444106af494b3d4c6cce0af88f6ce2a6a327eb
### Rationale for this change
This reverts a change made in pyarrow 17 which means that reading a single
file returns different results when that file happens to be located in a path
that contains `x=y` segments (i.e. that look like hive partition columns) than
when it doesn't. Particularly given the way some higher-level calls wrap this
functionality, e.g. by already opening a file before it is passed to
`ParquetDataset`, this can lead to confusing results, e.g. that are different
when running code on a local vs [...]
The original change was introduced in
https://github.com/apache/arrow/pull/39438 and there was a [discussion thread
about it](https://github.com/apache/arrow/pull/39438#discussion_r1469251517)
(sorry; github's links to resolved discussions don't always work well!) The
gist of the discussion thread seems to be that the PR author thought that this
code was unused, when in fact the subsequent issue shows that it _is_ used.
Screenshot of the original discussion thread to help you find it:
<img width="699" height="517" alt="image"
src="https://github.com/user-attachments/assets/a01618cc-c39d-48fb-9cb8-bd2c1b0c604f"
/>
### What changes are included in this PR?
Restores special "single file" handling for single-file paths passed to
`ParquetDataset` constructor, and analogous to the handling for an open file
handle.
This results in the loaded dataset _not_ parsing the full file path for
hive partition columns, which results in a different set of columns.
### Are these changes tested?
Added a new unit test. Verified that it fixes the issue I'd been
observing, and which I'd commented on in #43574, though I don't have a working
reproduction to verify that it fixes the original issue there.
### Are there any user-facing changes?
**This PR includes breaking changes to public APIs.** In particular, it
changes the columns returned by single-file calls to
`pyarrow.parquet.read_table(...)`, bringing the results back in line with
pyarrow<17.
While technically a breaking change, it should be noted that the original
PR that introduced this change in pyarrow 17 did not call out this change as a
breaking change. However, it's been some time since then, and it's plausible
that some applications have developed dependencies on the current behavior.
* GitHub Issue: #43574
Authored-by: Ben Kurtz <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
---
python/pyarrow/parquet/core.py | 2 ++
python/pyarrow/tests/parquet/test_dataset.py | 21 +++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 19d8250d51..5234976a92 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -1446,6 +1446,8 @@ Examples
path_or_paths, filesystem, memory_map=memory_map
)
finfo = filesystem.get_file_info(path_or_paths)
+ if finfo.is_file:
+ single_file = path_or_paths
if finfo.type == FileType.Directory:
self._base_dir = path_or_paths
else:
diff --git a/python/pyarrow/tests/parquet/test_dataset.py
b/python/pyarrow/tests/parquet/test_dataset.py
index d3e9cda730..5f04a48893 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -1250,6 +1250,27 @@ def test_parquet_dataset_new_filesystem(tempdir):
assert result.equals(table)
+def test_parquet_dataset_partitions_not_loaded_for_single_file(tempdir):
+ # Ensure single-file reads do not include partitions from higher levels of
the path
+ table = pa.table({'a': [1, 2, 3]})
+ path = tempdir / 'p=a' / 'data.parquet'
+ path.parent.mkdir()
+ pq.write_table(table, path)
+ # read using a path object
+ dataset = pq.ParquetDataset(path)
+ path_schema = dataset.schema
+ result = dataset.read()
+ assert result.equals(table)
+ # read using a file object; expect same result
+ with path.open("rb") as file:
+ dataset = pq.ParquetDataset(file)
+ file_schema = dataset.schema
+ result = dataset.read()
+ assert result.equals(table)
+ # schemas should match
+ assert path_schema.equals(file_schema)
+
+
def test_parquet_dataset_partitions_piece_path_with_fsspec(tempdir):
# ARROW-10462 ensure that on Windows we properly use posix-style paths
# as used by fsspec