[I] Performance reading S3 based files won't match localfilesystem even with large prebuffering. [arrow]

via GitHub Thu, 01 Feb 2024 14:43:26 -0800


mderoy opened a new issue, #39899:
URL: https://github.com/apache/arrow/issues/39899


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I'm writing a simple program which uses the low-level parquet parser apis 
`parquet::ParquetFileReader`... this parser calls PreBuffer with the rowgroups 
and columns I want to read (along with CachceOptions::Defaults())... I get 
`parquet::ColumnReaders` for each column and then loop through those and copy 
to a buffer so that the data is formed into a row-based format (as opposed to 
parquet's columnar format).
   in this example I'm only parsing booleans using
               `bool_reader->ReadBatch(1, nullptr, nullptr, (bool*)buf, 
&values_read);`
   to write directly to buf.
   my total test data is 284K and consists of parquet files which contain 3 
boolean columns, so it is very simple and should be fast.
   
   I find that when I benchmark my code against files on my local filesystem it 
takes about 1.7s to parse the data in this way...but when I give it an S3 file 
handle (created from arrow's s3) class It takes significantly longer **EVEN 
WITH PREBUFFER SETTINGS**
   
   I'm not including the prebuffer in my timings, but here is what I'm seeing
   localfilesystem 1.7s
   s3 without prebuffer 8.3s
   s3 with prebuffer 3.9s (not benchmarking the prebuffer time)
   
   I don't understand why the parsing of localfilesystem would be faster than 
s3 if I've prebuffered it into memory (remember I'm not including the prebuffer 
in these timings). I've tried changing the `parquet::ReaderProperties` buffer 
size to 20MB (which should fit the whole file in memory) but I can't seem to 
get equivalent performance with the local filesystem.
   
   Looking for some guidance...I'd really like to be as close to the 
localfilesystem performance as possible... I want to avoid downloading the 
whole file, but want the be able to read these prebuffered sections of the file 
efficiently.
   
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Performance reading S3 based files won't match localfilesystem even with large prebuffering. [arrow]

Reply via email to