Re: [I] How to best optimize reading from S3? [arrow-go]

via GitHub Thu, 13 Feb 2025 10:22:53 -0800


zeroshade commented on issue #278:
URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657404621


   Hi @stevbear 
   
   Optimizing the reading from S3 has been on my list for a while and I just 
hadn't gotten around to it. It didn't get prioritized because no one had filed 
any issues concerning it, so thank you for filing this!
   
   > When I set BufferedStreamEnabled to true, the library seems to be reading 
the row group page by page, which is not optimal for cloud usage.
   
   This is likely because the default buffer size is 16KB, have you tried 
increasing the `BufferSize` member of the `ReaderProperties` so that it buffers 
more pages at a time? Perhaps try a buffer of a few MB?
   
   I have a few ideas to further optimize via pre-buffering (which all have 
different trade-offs) so can you give me a bit more context to make sure that 
your use case would be helped and to identify which idea would work best for 
you?
   
   Specifically: If reading an entire column for a single row group gives you 
OOM, you either have a significantly large row group, or I'm guessing it's 
string data with a lot of large strings? That leads to the question of what 
you're doing with the column data after you read it from the row group. If you 
can't hold the entire column in memory from a single row group, are you 
streaming the data somewhere? Are you reading only a single column at a time or 
multiple columns from the row group?    Can you give me more of an idea of the 
sizes of the columns / row group of the file and the memory limitations of your 
system?
   
   Is the issue the copy that happens when decoding/decompressing the column 
data? etc.
   
   The more information the better so we can figure out a good solution here, 
gives me the opportunity to improve the memory usage of the parquet package 
like I've been wanting to! :smile:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] How to best optimize reading from S3? [arrow-go]

Reply via email to