pspoerri opened a new pull request, #41694:
URL: https://github.com/apache/arrow/pull/41694
### Rationale for this change
Allow reading from non-seekable FIFO paths (e.g. stdin).
Example: Currently the following code snippet is not allowed:
```
from pyarrow import csv, input_stream
stdin = input_stream('/dev/stdin')
data = csv.read_csv(stdin)
print(data)
```
Running this code will always trigger an OSError: ```
# cat test.csv | python test2.py
Traceback (most recent call last):
File "/mnt/nvme0n1/psp/arrow-test/test.py", line 4, in <module>
stdin = input_stream('/dev/stdin')
File "pyarrow/io.pxi", line 2690, in pyarrow.lib.input_stream
File "pyarrow/io.pxi", line 1164, in pyarrow.lib.OSFile.__cinit__
File "pyarrow/io.pxi", line 1176, in pyarrow.lib.OSFile._open_readable
File "pyarrow/error.pxi", line 154, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: lseek failed
```
To get around this implementation one has to read stdin through a python
file:
```
import sys
import os
from pyarrow import csv
stdin = os.fdopen(sys.stdin.fileno(), "rb")
data = csv.read_csv(stdin)
print(data)
```
Example csv:
```
customer_id,customer
1,customer1
2,customer2
```
### What changes are included in this PR?
Set the size of the to `-1` if the stream is not seekable. This has been
used here to configure non-seekable file descriptors:
https://github.com/pspoerri/arrow/blob/main/cpp/src/arrow/io/file.cc#L94-L95
### Are these changes tested?
I tested the code samples above.
### Are there any user-facing changes?
Not that I am aware of.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]