zeroshade commented on issue #354:
URL: https://github.com/apache/arrow-go/issues/354#issuecomment-2802821837

   @stevbear The behavior is:
   
   If `BufferedStreamEnabled` == `true`: When a `ColumnReader` is initialized, 
it will utilize `io.NewSectionReader` and the `BufferSize` set in the reader 
properties to buffer as it reads data from the column. For high-latency systems 
(like cloud object stores) this could potentially be non-optimal depending on 
the buffer size used, but it does help control the amount of memory utilized by 
relying on the buffersize.
   
   If `BufferedStreamEnabled` == `false`: When a `ColumnReader` is initialized, 
it will read the entire *column* into memory, not the entire file. At the 
expense of more memory usage, this could be more performant when dealing with 
cloud object storage as it will perform only a single round trip to retrieve 
the column data. But if you are going to utilize page indexes or offset indexes 
to only read a subset of the data in the column, you'll end up reading far more 
data (i.e. higher memory usage) than you might actually need to.
   
   You can see where the logic is managed here: 
https://github.com/apache/arrow-go/blob/main/parquet/reader_properties.go#L75
   
   And it is called here: 
https://github.com/apache/arrow-go/blob/main/parquet/file/row_group_reader.go#L117
 when creating a `PageReader` for a column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to