lesterfan opened a new pull request, #47302: URL: https://github.com/apache/arrow/pull/47302
### Rationale for this change This PR resolves the issue reported in https://github.com/apache/arrow/issues/47301. There are [three possible file source types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L104) in which a `CFileSource` can be created: 1. From a `pa.Buffer`. 2. From a `path` string. 3. From a file-like object which has a `read` attribute. However, `FileFragment.open()` currently only [explicitly handles the first two types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L2005). When `open` is called with a `FileFragment` created from type (3), the current implementation tries to read the `path` which is set to a string called `"<Buffer>"` ([source](https://github.com/apache/arrow/blob/135357ce3824d1a8e1aba5a19d897b0c02b22ab7/cpp/src/arrow/dataset/file_base.h#L106)). This causes the seg fault as observed in the linked issue. ### What changes are included in this PR? 1. Modify `FileFragment.open()` to handle the three `CFileSource` cases as listed above. 2. Add a unit test which seg faults without the change in (1) and passes with the change. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes; this PR fixes a user facing bug in the `FileFragment` API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org