zeroshade commented on issue #278: URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657404621
Hi @stevbear Optimizing the reading from S3 has been on my list for a while and I just hadn't gotten around to it. It didn't get prioritized because no one had filed any issues concerning it, so thank you for filing this! > When I set BufferedStreamEnabled to true, the library seems to be reading the row group page by page, which is not optimal for cloud usage. This is likely because the default buffer size is 16KB, have you tried increasing the `BufferSize` member of the `ReaderProperties` so that it buffers more pages at a time? Perhaps try a buffer of a few MB? I have a few ideas to further optimize via pre-buffering (which all have different trade-offs) so can you give me a bit more context to make sure that your use case would be helped and to identify which idea would work best for you? Specifically: If reading an entire column for a single row group gives you OOM, you either have a significantly large row group, or I'm guessing it's string data with a lot of large strings? That leads to the question of what you're doing with the column data after you read it from the row group. If you can't hold the entire column in memory from a single row group, are you streaming the data somewhere? Are you reading only a single column at a time or multiple columns from the row group? Can you give me more of an idea of the sizes of the columns / row group of the file and the memory limitations of your system? Is the issue the copy that happens when decoding/decompressing the column data? etc. The more information the better so we can figure out a good solution here, gives me the opportunity to improve the memory usage of the parquet package like I've been wanting to! :smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
