bkurtz opened a new pull request, #49853: URL: https://github.com/apache/arrow/pull/49853
Fixes #43574 (I think; fixes the aspect of it that I ran into, but haven't checked the OP's issue). Reverts a small portion of bd444106af494b3d4c6cce0af88f6ce2a6a327eb ### Rationale for this change This reverts a change made in pyarrow 17 which means that reading a single file returns different results when that file happens to be located in a path that contains `x=y` segments (i.e. that look like hive partition columns) than when it doesn't. Particularly given the way some higher-level calls wrap this functionality, e.g. by already opening a file before it is passed to `ParquetDataset`, this can lead to confusing results, e.g. that are different when running code on a local vs remote filesystem. For example, for single-file local reads, `pandas.read_parquet` already opens a filehandle to pass to pyarrow, while for remote reads, it passes a single-file path + filesystem, resulting in code that works differently when tested on a local filesystem compared to the deployed cloud filesystem. The original change was introduced in https://github.com/apache/arrow/pull/39438 and there was a [discussion thread about it](https://github.com/apache/arrow/pull/39438#discussion_r1469251517) (sorry; github's links to resolved discussions don't always work well!) The gist of the discussion thread seems to be that the PR author thought that this code was unused, when in fact the subsequent issue shows that it _is_ used. <img width="699" height="517" alt="image" src="https://github.com/user-attachments/assets/a01618cc-c39d-48fb-9cb8-bd2c1b0c604f" /> ### What changes are included in this PR? Restores special "single file" handling for single-file paths passed to `ParquetDataset` constructor, and analogous to the handling for an open file handle. This results in the loaded dataset _not_ parsing the full file path for hive partition columns, which results in a different set of columns. ### Are these changes tested? Added a new unit test. Verified that it fixes the issue I'd been observing, and which I'd commented on in #43574, though I don't have a working reproduction to verify that it fixes the original issue there. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** In particular, it changes the columns returned by single-file calls to `pyarrow.parquet.read_table(...)`, bringing the results back in line with pyarrow<17. While technically a breaking change, it should be noted that the original PR that introduced this change in pyarrow 17 did not call out this change as a breaking change. However, it's been some time since then, and it's plausible that some applications have developed dependencies on the current behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
