[jira] [Commented] (ARROW-7661) [Python] Specify number of batches for read_csv

Joris Van den Bossche (Jira) Tue, 28 Jan 2020 07:20:28 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025203#comment-17025203
 ]


Joris Van den Bossche commented on ARROW-7661:
----------------------------------------------

OK, that shouldn't happen. 

Can you give a bit more details on your system? Which OS? What version of 
pyarrow?

Also, can you check the two batches? Eg is the second one an empty batch? (your 
example above has 7 rows, how are they put in the batches?)

> [Python] Specify number of batches for read_csv
> -----------------------------------------------
>
>                 Key: ARROW-7661
>                 URL: https://issues.apache.org/jira/browse/ARROW-7661
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1, 0.15.0, 0.15.1
>            Reporter: Sascha Hofmann
>            Priority: Major
>
> We are reading a very simple csv (see below).
> The file is only 245 bytes so way below the default _block_size_ in the 
> _ReadOptions_. Thus we would expect the resulting table to have only one 
> batch. At least, if  I understand correctly that a _block_ refers to the 
> number of lines of certain byte size? 
> The docs state: _This will determine multi-threading granularity as well as 
> the size of individual chunks in the Table._ For me, that means also the size 
> of individual batches? 
> Previously, we thought by fixing the block_size to the total file size, we 
> would ensure that even for files larger than 1MB we get a pa.Table with only 
> one batch. This mini file seems to prove us wrong?
> Additionally, if I convert back and forth to pandas we get only one batch.
>  
> To reproduce:
> {code:java}
> import os
> from pyarrow import csv as pc
> import pyarrow as pa
> path = "test.csv"
> read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
> df = pc.read_csv(path, read_options=read_options)
> print(len(df.to_batches()))
> # returns 2
> print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
> # returns the last line of the file
> pdf = df.to_pandas()
> ndf = pa.Table.from_pandas(pdf)
> print(len(ndf.to_batches()))
> # returns 1{code}
> test.csv:
> {code:java}
> "Name","Month","Change in %"
> "Surrey Quays","Sep 18","1.01"
> "Surrey Quays","Oct 18","0.38"
> "Surrey Quays","Nov 18","0.97"
> "Surrey Quays","Dec 18","1.28"
> "Surrey Quays","Jan 19","2.43"
> "Surrey Quays","Feb 19","2.49"
> "Surrey Quays","Mar 19","0.81"
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7661) [Python] Specify number of batches for read_csv

Reply via email to