zeroshade commented on issue #354: URL: https://github.com/apache/arrow-go/issues/354#issuecomment-2802821837
@stevbear The behavior is: If `BufferedStreamEnabled` == `true`: When a `ColumnReader` is initialized, it will utilize `io.NewSectionReader` and the `BufferSize` set in the reader properties to buffer as it reads data from the column. For high-latency systems (like cloud object stores) this could potentially be non-optimal depending on the buffer size used, but it does help control the amount of memory utilized by relying on the buffersize. If `BufferedStreamEnabled` == `false`: When a `ColumnReader` is initialized, it will read the entire *column* into memory, not the entire file. At the expense of more memory usage, this could be more performant when dealing with cloud object storage as it will perform only a single round trip to retrieve the column data. But if you are going to utilize page indexes or offset indexes to only read a subset of the data in the column, you'll end up reading far more data (i.e. higher memory usage) than you might actually need to. You can see where the logic is managed here: https://github.com/apache/arrow-go/blob/main/parquet/reader_properties.go#L75 And it is called here: https://github.com/apache/arrow-go/blob/main/parquet/file/row_group_reader.go#L117 when creating a `PageReader` for a column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org