[ 
https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Dimitrov updated ARROW-5318:
---------------------------------
    Description: 
I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I 
often get 0%-300% more data sent over the network. My suspicion is that pyarrow 
is reading ahead.

The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
way to turn off read ahead for the general HDFS interface.

I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
(newest released version). I am on python 2.7

I have been using wireshark to track the packets passed on the network.

I suspect it is read ahead since the time for the 1st read is much greater than 
the time for 2nd read.

 

The regular pyarrow reader
{code:java}
import pyarrow as pa 
fs = pa.hdfs.connect(hostname, driver='libhdfs') 
file_path = 'dataset/train/piece0000' 
f = fs.open(file_path) 
f.seek(0) 
n_bytes = 3000000 
f.read(n_bytes)
{code}
 

Parquet code without the same issue
{code:java}
parquet_file = 'dataset/train/parquet/part-22e3' 
pf = fs.open(parquet_path) 
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])
 {code}
 

 

  was:
I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I 
often get 0%-300% more data sent over the network. My suspicion is that pyarrow 
is reading ahead.

The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
way to turn off read ahead for the general HDFS interface.

I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
(newest released version). I am on python 2.7

I have been using wireshark to track the packets passed on the network.

I suspect it is read ahead since the time for the 1st read is much greater than 
the time for 2nd read.

 

The regular pyarrow reader
{code:java}
import pyarrow as pa 
fs = pa.hdfs.connect(hostname, driver='libhdfs') 
file_path = 'dataset/train/piece0000' 
f = fs.open(file_path) 
f.seek(0) 
n_bytes = 3000000 
f.read(n_bytes)
{code}
 

Parquet code without the same issue
{code:java}
parquet_file = 'dataset/train/parquet/part-22e3' pf = fs.open(parquet_path) pqf 
= pa.parquet.ParquetFile(pf) data = pqf.read_row_group(0, columns=['col_name'])
parquet_file = 'dataset/train/parquet/part-22e3'{code}
 

 

 


> pyarrow hdfs reader overrequests  
> ----------------------------------
>
>                 Key: ARROW-5318
>                 URL: https://issues.apache.org/jira/browse/ARROW-5318
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>            Reporter: Ivan Dimitrov
>            Priority: Blocker
>
> I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, 
> I often get 0%-300% more data sent over the network. My suspicion is that 
> pyarrow is reading ahead.
> The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
> way to turn off read ahead for the general HDFS interface.
> I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
> (newest released version). I am on python 2.7
> I have been using wireshark to track the packets passed on the network.
> I suspect it is read ahead since the time for the 1st read is much greater 
> than the time for 2nd read.
>  
> The regular pyarrow reader
> {code:java}
> import pyarrow as pa 
> fs = pa.hdfs.connect(hostname, driver='libhdfs') 
> file_path = 'dataset/train/piece0000' 
> f = fs.open(file_path) 
> f.seek(0) 
> n_bytes = 3000000 
> f.read(n_bytes)
> {code}
>  
> Parquet code without the same issue
> {code:java}
> parquet_file = 'dataset/train/parquet/part-22e3' 
> pf = fs.open(parquet_path) 
> pqf = pa.parquet.ParquetFile(pf)
> data = pqf.read_row_group(0, columns=['col_name'])
>  {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to