lesterfan opened a new pull request, #47302:
URL: https://github.com/apache/arrow/pull/47302

   ### Rationale for this change
   
   This PR resolves the issue reported in 
https://github.com/apache/arrow/issues/47301.
   
   There are [three possible file source 
types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L104)
 in which a `CFileSource` can be created:
   
   1. From a `pa.Buffer`.
   2. From a `path` string.
   3. From a file-like object which has a `read` attribute.
   
   However, `FileFragment.open()` currently only [explicitly handles the first 
two 
types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L2005).
 When `open` is called with a `FileFragment` created from type (3), the current 
implementation tries to read the `path` which is set to a string called 
`"<Buffer>"` 
([source](https://github.com/apache/arrow/blob/135357ce3824d1a8e1aba5a19d897b0c02b22ab7/cpp/src/arrow/dataset/file_base.h#L106)).
 This causes the seg fault as observed in the linked issue.
   
   ### What changes are included in this PR?
   
   1. Modify `FileFragment.open()` to handle the three `CFileSource` cases as 
listed above.
   2. Add a unit test which seg faults without the change in (1) and passes 
with the change.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes; this PR fixes a user facing bug in the `FileFragment` API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to