[ 
https://issues.apache.org/jira/browse/ARROW-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-7972.
-------------------------------
    Resolution: Duplicate

Closing to follow the issue on ARROW-3410

> [Python] Allow reading CSV in chunks
> ------------------------------------
>
>                 Key: ARROW-7972
>                 URL: https://issues.apache.org/jira/browse/ARROW-7972
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 0.16.0
>            Reporter: Bulat Yaminov
>            Priority: Major
>
> Currently in the Python API you can read a CSV using 
> [{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html].
>  There are some settings for the reader that you can pass in 
> [{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions],
>  but I don't see an option to read a part of the CSV file instead of the 
> whole (or starting from `skip_rows`). As a result if I have a big CSV file 
> that cannot be fit into memory, I cannot process it with this API.
> Is it possible to implement a chunked iterator in the similar way that 
> [Pandas allows 
> it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
> {code:python}
> from pyarrow import csv
> for table_chunk in csv.read_csv("big.csv", 
> read_options=csv.ReadOptions(chunksize=1_000_000)):
>     # do something with the table_chunk, e.g. filter and save to disk
>     pass
> {code}
> Thanks in advance for your reaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to