[
https://issues.apache.org/jira/browse/ARROW-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney closed ARROW-7972.
-------------------------------
Resolution: Duplicate
Closing to follow the issue on ARROW-3410
> [Python] Allow reading CSV in chunks
> ------------------------------------
>
> Key: ARROW-7972
> URL: https://issues.apache.org/jira/browse/ARROW-7972
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 0.16.0
> Reporter: Bulat Yaminov
> Priority: Major
>
> Currently in the Python API you can read a CSV using
> [{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html].
> There are some settings for the reader that you can pass in
> [{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions],
> but I don't see an option to read a part of the CSV file instead of the
> whole (or starting from `skip_rows`). As a result if I have a big CSV file
> that cannot be fit into memory, I cannot process it with this API.
> Is it possible to implement a chunked iterator in the similar way that
> [Pandas allows
> it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
> {code:python}
> from pyarrow import csv
> for table_chunk in csv.read_csv("big.csv",
> read_options=csv.ReadOptions(chunksize=1_000_000)):
> # do something with the table_chunk, e.g. filter and save to disk
> pass
> {code}
> Thanks in advance for your reaction.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)