[jira] [Updated] (ARROW-7661) [Python] Specify number of batches for read_csv

Sascha Hofmann (Jira) Thu, 23 Jan 2020 03:15:36 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sascha Hofmann updated ARROW-7661:
----------------------------------
    Description: 
We are reading a very simple csv (see below).
The file is only 245 bytes so way below the default _block_size_ in the 
_ReadOptions_. Thus we would expect the resulting table to have only one batch. 
At least, if  I understand correctly that a _block_ refers to the number of 
lines of certain byte size? 

The docs state: _This will determine multi-threading granularity as well as the 
size of individual chunks in the Table._ For me, that means also the size of 
individual batches? 

Previously, we thought by fixing the block_size to the total file size, we 
would ensure that even for files larger than 1MB we get a pa.Table with only 
one batch. This mini file seems to prove us wrong?

Additionally, if I convert back and forth to pandas we get only one batch.

 

To reproduce:
{code:java}
import os
from pyarrow import csv as pc
import pyarrow as pa
path = "test.csv"
read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
df = pc.read_csv(path, read_options=read_options)
print(len(df.to_batches()))
# returns 2
print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
# returns the last line of the file
pdf = df.to_pandas()
ndf = pa.Table.from_pandas(pdf)
print(len(ndf.to_batches()))
# returns 1{code}
test.csv:
{code:java}
"Name","Month","Change in %"
"Surrey Quays","Sep 18","1.01"
"Surrey Quays","Oct 18","0.38"
"Surrey Quays","Nov 18","0.97"
"Surrey Quays","Dec 18","1.28"
"Surrey Quays","Jan 19","2.43"
"Surrey Quays","Feb 19","2.49"
"Surrey Quays","Mar 19","0.81"
{code}
 

 

 

  was:
We are reading a very simple csv (see below) and want to fix the number of 
created batches to only 1. The file is only 245 bytes so way below the default 
block_size in the ReadOptions. About that I have the first question: I 
understand that a block refers to the number of lines of certain byte size? 

The docs state: _This will determine multi-threading granularity as well as the 
size of individual chunks in the Table._ For me, that means also the size of 
individual batches? 

Previously, we thought by fixing the block_size to the total file size we would 
ensure that even for files larger than 1MB we get a pa.Table with only one 
batch. This mini file seems to prove us wrong?

Additionally, if I convert back and forth to pandas we get only one batch.

 

To reproduce:
{code:java}
import os
from pyarrow import csv as pc
import pyarrow as pa
path = "test.csv"
read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
df = pc.read_csv(path, read_options=read_options)
print(len(df.to_batches()))
# returns 2
print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
# returns the last line of the file
pdf = df.to_pandas()
ndf = pa.Table.from_pandas(pdf)
print(len(ndf.to_batches()))
# returns 1{code}
test.csv:
{code:java}
"Name","Month","Change in %"
"Surrey Quays","Sep 18","1.01"
"Surrey Quays","Oct 18","0.38"
"Surrey Quays","Nov 18","0.97"
"Surrey Quays","Dec 18","1.28"
"Surrey Quays","Jan 19","2.43"
"Surrey Quays","Feb 19","2.49"
"Surrey Quays","Mar 19","0.81"
{code}
 

 

 


> [Python] Specify number of batches for read_csv
> -----------------------------------------------
>
>                 Key: ARROW-7661
>                 URL: https://issues.apache.org/jira/browse/ARROW-7661
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1, 0.15.0, 0.15.1
>            Reporter: Sascha Hofmann
>            Priority: Major
>
> We are reading a very simple csv (see below).
> The file is only 245 bytes so way below the default _block_size_ in the 
> _ReadOptions_. Thus we would expect the resulting table to have only one 
> batch. At least, if  I understand correctly that a _block_ refers to the 
> number of lines of certain byte size? 
> The docs state: _This will determine multi-threading granularity as well as 
> the size of individual chunks in the Table._ For me, that means also the size 
> of individual batches? 
> Previously, we thought by fixing the block_size to the total file size, we 
> would ensure that even for files larger than 1MB we get a pa.Table with only 
> one batch. This mini file seems to prove us wrong?
> Additionally, if I convert back and forth to pandas we get only one batch.
>  
> To reproduce:
> {code:java}
> import os
> from pyarrow import csv as pc
> import pyarrow as pa
> path = "test.csv"
> read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
> df = pc.read_csv(path, read_options=read_options)
> print(len(df.to_batches()))
> # returns 2
> print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
> # returns the last line of the file
> pdf = df.to_pandas()
> ndf = pa.Table.from_pandas(pdf)
> print(len(ndf.to_batches()))
> # returns 1{code}
> test.csv:
> {code:java}
> "Name","Month","Change in %"
> "Surrey Quays","Sep 18","1.01"
> "Surrey Quays","Oct 18","0.38"
> "Surrey Quays","Nov 18","0.97"
> "Surrey Quays","Dec 18","1.28"
> "Surrey Quays","Jan 19","2.43"
> "Surrey Quays","Feb 19","2.49"
> "Surrey Quays","Mar 19","0.81"
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7661) [Python] Specify number of batches for read_csv

Reply via email to