etseidl commented on PR #9868: URL: https://github.com/apache/arrow-rs/pull/9868#issuecomment-4373820729
I've tracked the regression down to increased overhead in `try_reserve_exact`. This shows up particularly in the page index bench due to the large number of vector reads that are done in the parsing of the page index structures. I think for now we should limit the scope of this PR to fixing the wraparound in `read_list_begin`. The question of how to deal with potential OOM errors should be left to a larger discussion. While an abort is not ideal, that appears to be part of the larger design philosophy of Rust. Given the only way to deal with OOM is to exit anyway, why place an extra burden on every allocation? As to the suggested alternate fix for this issue, namely erroring if the size of an allocation exceeds the size of the input buffer, that approach too has serious problems. The foremost being that the input stream is highly compressed, meaning the resultant vectors can be much larger. For instance, `size_of::<SchemaElement>()` returns 96 IIRC. But the schema for the wide benchmark (10,000 columns), is on the order of 138kB. `96*10_000` is much larger, and could potentially trigger the proposed OOM detection code for a valid file. To summarize: - Let's focus only on detecting wraparound in `read_list_begin` - Let's move discussion of OOM issues with `with_capacity` to #9874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
