jiale0402 opened a new issue, #38878:
URL: https://github.com/apache/arrow/issues/38878
### Describe the usage question you have. Please include as many useful
details as possible.
platform:
`NAME="Ubuntu"
VERSION="23.04 (Lunar Lobster)"`
pyarrow version:
`pyarrow 14.0.1`
`pyarrow-hotfix 0.5 `
python version:
`Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux`
I have a very large single column csv file (about 63 million rows). I was
hoping to create a lazy file streamer that reads one entry from the csv file at
a time. I know each entry in my file has a length of 12 chars, so I tried
setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
`import pyarrow.csv as csv`
`c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})`
`r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True,
column_names=["dne"],block_size=13)`
`stream = csv.open_csv(file, convert_options = c_options,
read_options = r_options
)`
this code functions properly as expected, but when i change the
`skip_rows_after_names` param of read options to 8300 I start to get
segmentation faults. How to fix this (or am I using it wrong)? I want to be
able to use only a portion of at (like from row 98885 to 111200)
I was able to produce this error on another computer with the exact same
platform and versions. The file was created with
`with open(f"feature_{i}.csv", "w+") as f:
for i in range(FILE_LEN):
n = random.uniform(-0.5, 0.5)
nn = str(n)[:12]
f.write(f"{nn}\n")`
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]