bkurtz opened a new pull request, #49853:
URL: https://github.com/apache/arrow/pull/49853

   Fixes #43574 (I think; fixes the aspect of it that I ran into, but haven't 
checked the OP's issue).
   
   Reverts a small portion of bd444106af494b3d4c6cce0af88f6ce2a6a327eb
   
   ### Rationale for this change
   This reverts a change made in pyarrow 17 which means that reading a single 
file returns different results when that file happens to be located in a path 
that contains `x=y` segments (i.e. that look like hive partition columns) than 
when it doesn't.  Particularly given the way some higher-level calls wrap this 
functionality, e.g. by already opening a file before it is passed to 
`ParquetDataset`, this can lead to confusing results, e.g. that are different 
when running code on a local vs remote filesystem.  For example, for 
single-file local reads, `pandas.read_parquet` already opens a filehandle to 
pass to pyarrow, while for remote reads, it passes a single-file path + 
filesystem, resulting in code that works differently when tested on a local 
filesystem compared to the deployed cloud filesystem.
   
   The original change was introduced in 
https://github.com/apache/arrow/pull/39438 and there was a [discussion thread 
about it](https://github.com/apache/arrow/pull/39438#discussion_r1469251517) 
(sorry; github's links to resolved discussions don't always work well!)  The 
gist of the discussion thread seems to be that the PR author thought that this 
code was unused, when in fact the subsequent issue shows that it _is_ used.
   
   <img width="699" height="517" alt="image" 
src="https://github.com/user-attachments/assets/a01618cc-c39d-48fb-9cb8-bd2c1b0c604f";
 />
   
   ### What changes are included in this PR?
   Restores special "single file" handling for single-file paths passed to 
`ParquetDataset` constructor, and analogous to the handling for an open file 
handle.
   
   This results in the loaded dataset _not_ parsing the full file path for hive 
partition columns, which results in a different set of columns.
   
   ### Are these changes tested?
   Added a new unit test.  Verified that it fixes the issue I'd been observing, 
and which I'd commented on in #43574, though I don't have a working 
reproduction to verify that it fixes the original issue there.
   
   ### Are there any user-facing changes?
   
   **This PR includes breaking changes to public APIs.**  In particular, it 
changes the columns returned by single-file calls to 
`pyarrow.parquet.read_table(...)`, bringing the results back in line with 
pyarrow<17.
   
   While technically a breaking change, it should be noted that the original PR 
that introduced this change in pyarrow 17 did not call out this change as a 
breaking change.  However, it's been some time since then, and it's plausible 
that some applications have developed dependencies on the current behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to