[jira] [Comment Edited] (ARROW-10419) [C++] Add max_rows parameter to csv ReadOptions

Joris Van den Bossche (Jira) Fri, 30 Oct 2020 05:24:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223614#comment-17223614
 ]


Joris Van den Bossche edited comment on ARROW-10419 at 10/30/20, 12:23 PM:
---------------------------------------------------------------------------

Agreed that such a keyword would be useful (just for being able to peek at the 
first rows of a big file this would be important).

I think in general the problem with this is that the reader processes different 
blocks of data in parallel (there is a {{block_size}} option in 
{{ReadOptions}}), so this might not work with the default of multithreaded 
reading (you need to know the number of rows of the first block to know if you 
need to process the next block as well). 

For the "chunked" reader case you mention, there is already 
{{pyarrow.csv.open_csv}} which returns a streaming reader, which reads batch by 
batch:

{code}
In [1]: pd.DataFrame({'a': np.arange(1_000_000)}).to_csv("test.csv", 
index=False)

In [2]: from pyarrow import csv

In [3]: reader = csv.open_csv("test.csv")

In [4]: reader
Out[4]: <pyarrow._csv.CSVStreamingReader at 0x7fe629398278>

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0            0
1            1
2            2
3            3
4            4
...        ...
165664  165664
165665  165665
165666  165666
165667  165667
165668  165668

[165669 rows x 1 columns]

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0       165669
1       165670
2       165671
3       165672
4       165673
...        ...
149791  315460
149792  315461
149793  315462
149794  315463
149795  315464

[149796 rows x 1 columns]
{code}

The number of rows depends on the {{block_size}}, so you _can_ control this, 
but not that easily by just specifying the number of {{max_rows}}. 

{code}
In [13]: reader = csv.open_csv("test.csv", 
read_options=csv.ReadOptions(block_size=20))

In [14]: reader.read_next_batch().to_pandas()
Out[14]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8

In [15]: reader.read_next_batch().to_pandas()
Out[15]: 
    a
0   9
1  10
2  11
3  12
4  13
5  14
6  15
{code}

So using this streaming reader with {{open_csv}}, you can actually already 
somewhat achieve what you want, I think? (it could also be used to read only 
the first xx rows, instead of using {{read_csv}})

The question is still if we can make this easier directly with pyarrow, to be 
able to specify a {{max_rows}} instead of a {{block_size}}. For the general 
multithreaded reader, I don't think this easy (as mentioned above), but for the 
streaming reader (which is single threaded anyway) it should be possible I 
suppose. 

cc [~apitrou]


was (Author: jorisvandenbossche):
Agreed that such a keyword would be useful (just for being able to peek at the 
first rows of a big file would be very useful).

I think in general the problem with this is that the reader processes different 
blocks of data in parallel (there is a {{block_size}} option in 
{{ReadOptions}}), so this might not work with the default of multithreaded 
reading (you need to know the number of rows of the first block to know if you 
need to process the next block as well). 

For the "chunked" reader case you mention, there is already 
{{pyarrow.csv.open_csv}} which returns a streaming reader, which reads batch by 
batch:

{code}
In [1]: pd.DataFrame({'a': np.arange(1_000_000)}).to_csv("test.csv", 
index=False)

In [2]: from pyarrow import csv

In [3]: reader = csv.open_csv("test.csv")

In [4]: reader
Out[4]: <pyarrow._csv.CSVStreamingReader at 0x7fe629398278>

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0            0
1            1
2            2
3            3
4            4
...        ...
165664  165664
165665  165665
165666  165666
165667  165667
165668  165668

[165669 rows x 1 columns]

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0       165669
1       165670
2       165671
3       165672
4       165673
...        ...
149791  315460
149792  315461
149793  315462
149794  315463
149795  315464

[149796 rows x 1 columns]
{code}

The number of rows depends on the {{block_size}}, so you _can_ control this, 
but not that easily by just specifying the number of {{max_rows}}. 

{code}
In [13]: reader = csv.open_csv("test.csv", 
read_options=csv.ReadOptions(block_size=20))

In [14]: reader.read_next_batch().to_pandas()
Out[14]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8

In [15]: reader.read_next_batch().to_pandas()
Out[15]: 
    a
0   9
1  10
2  11
3  12
4  13
5  14
6  15
{code}

So using this streaming reader with {{open_csv}}, you can actually already 
somewhat achieve what you want, I think? (it could also be used to read only 
the first xx rows, instead of using {{read_csv}})

The question is still if we can make this easier directly with pyarrow, to be 
able to specify a {{max_rows}} instead of a {{block_size}}. For the general 
multithreaded reader, I don't think this easy (as mentioned above), but for the 
streaming reader (which is single threaded anyway) it should be possible I 
suppose. 

cc [~apitrou]

> [C++] Add max_rows parameter to csv ReadOptions
> -----------------------------------------------
>
>                 Key: ARROW-10419
>                 URL: https://issues.apache.org/jira/browse/ARROW-10419
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: Marc Garcia
>            Priority: Major
>              Labels: csv
>
> I'm trying to read only the first 1,000 rows of a huge CSV with PyArrow.
> I don't see a way to do this with Arrow. I guess it should be easy to 
> implement by adding a `max_rows` parameter to pyarrow.csv.ReadOptions.
> After reading the first 1,000, it should be possible to load the next 1,000 
> (or any other chunk) by using both the new `max_rows` together with 
> `skip_rows` (e.g. `pyarrow.csv.read_csv(path, 
> pyarrow.csv.ReadOption(skip_rows=1_000, max_rows=1_000)` would read from 
> 1,000 to 2,000).
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10419) [C++] Add max_rows parameter to csv ReadOptions

Reply via email to