This is an automated email from the ASF dual-hosted git repository.

AlenkaF pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 23cd1ff8f4 GH-43574: [Python][Parquet] do not add partition columns 
from file path when reading single file (#49853)
23cd1ff8f4 is described below

commit 23cd1ff8f4e33b3207875e3395d2d6b1aeb1edc2
Author: bkurtz <[email protected]>
AuthorDate: Wed May 6 02:03:24 2026 -0700

    GH-43574: [Python][Parquet] do not add partition columns from file path 
when reading single file (#49853)
    
    Fixes #43574 by reverting a small portion of 
bd444106af494b3d4c6cce0af88f6ce2a6a327eb
    
    ### Rationale for this change
    This reverts a change made in pyarrow 17 which means that reading a single 
file returns different results when that file happens to be located in a path 
that contains `x=y` segments (i.e. that look like hive partition columns) than 
when it doesn't.  Particularly given the way some higher-level calls wrap this 
functionality, e.g. by already opening a file before it is passed to 
`ParquetDataset`, this can lead to confusing results, e.g. that are different 
when running code on a local vs [...]
    
    The original change was introduced in 
https://github.com/apache/arrow/pull/39438 and there was a [discussion thread 
about it](https://github.com/apache/arrow/pull/39438#discussion_r1469251517) 
(sorry; github's links to resolved discussions don't always work well!)  The 
gist of the discussion thread seems to be that the PR author thought that this 
code was unused, when in fact the subsequent issue shows that it _is_ used.
    
    Screenshot of the original discussion thread to help you find it:
    <img width="699" height="517" alt="image" 
src="https://github.com/user-attachments/assets/a01618cc-c39d-48fb-9cb8-bd2c1b0c604f";
 />
    
    ### What changes are included in this PR?
    Restores special "single file" handling for single-file paths passed to 
`ParquetDataset` constructor, and analogous to the handling for an open file 
handle.
    
    This results in the loaded dataset _not_ parsing the full file path for 
hive partition columns, which results in a different set of columns.
    
    ### Are these changes tested?
    Added a new unit test.  Verified that it fixes the issue I'd been 
observing, and which I'd commented on in #43574, though I don't have a working 
reproduction to verify that it fixes the original issue there.
    
    ### Are there any user-facing changes?
    
    **This PR includes breaking changes to public APIs.**  In particular, it 
changes the columns returned by single-file calls to 
`pyarrow.parquet.read_table(...)`, bringing the results back in line with 
pyarrow<17.
    
    While technically a breaking change, it should be noted that the original 
PR that introduced this change in pyarrow 17 did not call out this change as a 
breaking change.  However, it's been some time since then, and it's plausible 
that some applications have developed dependencies on the current behavior.
    * GitHub Issue: #43574
    
    Authored-by: Ben Kurtz <[email protected]>
    Signed-off-by: AlenkaF <[email protected]>
---
 python/pyarrow/parquet/core.py               |  2 ++
 python/pyarrow/tests/parquet/test_dataset.py | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/python/pyarrow/parquet/core.py b/python/pyarrow/parquet/core.py
index 19d8250d51..5234976a92 100644
--- a/python/pyarrow/parquet/core.py
+++ b/python/pyarrow/parquet/core.py
@@ -1446,6 +1446,8 @@ Examples
                     path_or_paths, filesystem, memory_map=memory_map
                 )
                 finfo = filesystem.get_file_info(path_or_paths)
+                if finfo.is_file:
+                    single_file = path_or_paths
                 if finfo.type == FileType.Directory:
                     self._base_dir = path_or_paths
             else:
diff --git a/python/pyarrow/tests/parquet/test_dataset.py 
b/python/pyarrow/tests/parquet/test_dataset.py
index d3e9cda730..5f04a48893 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -1250,6 +1250,27 @@ def test_parquet_dataset_new_filesystem(tempdir):
     assert result.equals(table)
 
 
+def test_parquet_dataset_partitions_not_loaded_for_single_file(tempdir):
+    # Ensure single-file reads do not include partitions from higher levels of 
the path
+    table = pa.table({'a': [1, 2, 3]})
+    path = tempdir / 'p=a' / 'data.parquet'
+    path.parent.mkdir()
+    pq.write_table(table, path)
+    # read using a path object
+    dataset = pq.ParquetDataset(path)
+    path_schema = dataset.schema
+    result = dataset.read()
+    assert result.equals(table)
+    # read using a file object; expect same result
+    with path.open("rb") as file:
+        dataset = pq.ParquetDataset(file)
+        file_schema = dataset.schema
+        result = dataset.read()
+    assert result.equals(table)
+    # schemas should match
+    assert path_schema.equals(file_schema)
+
+
 def test_parquet_dataset_partitions_piece_path_with_fsspec(tempdir):
     # ARROW-10462 ensure that on Windows we properly use posix-style paths
     # as used by fsspec

Reply via email to