Bulat Yaminov created ARROW-7972:
------------------------------------
Summary: Allow reading CSV in chunks
Key: ARROW-7972
URL: https://issues.apache.org/jira/browse/ARROW-7972
Project: Apache Arrow
Issue Type: New Feature
Components: Python
Affects Versions: 0.16.0
Reporter: Bulat Yaminov
Currently in the Python API you can read a CSV using
[{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html].
There are some settings for the reader that you can pass in
[{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions],
but I don't see an option to read a part of the CSV file instead of the whole
(or starting from `skip_rows`). As a result if I have a big CSV file that
cannot be fit into memory, I cannot process it with this API.
Is it possible to implement a chunked iterator in the similar way that [Pandas
allows
it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
{code:python}
from pyarrow import csv
for table_chunk in csv.read_csv("big.csv",
read_options=csv.ReadOptions(chunksize=1_000_000)):
# do something with the table_chunk, e.g. filter and save to disk
pass
{code}
Thanks in advance for your reaction.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)