zanmato1984 commented on PR #43389: URL: https://github.com/apache/arrow/pull/43389#issuecomment-2296514853
Thanks for the feedback! > * there was a lengthy discussion (and a document) about larger than memory datasets [GH-31769: [C++][Acero] Add spilling for hash join #13669](https://github.com/apache/arrow/pull/13669) [[C++] Support hash-join on larger than memory datasets #31769](https://github.com/apache/arrow/issues/31769), will there be any progress in this direction? Now that the limit is gone I expect an influx of reports about crashed code solely because of the dataset size. There is unfortunately no update nor plan about it AFAIK. By "crashed code solely because of the data size" if you mean the oom-kill of the operating system, then yes :) > * > An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit. > > A wider offset type requires a few more SIMD instructions in each 8-row processing iteration. > > > Do you think it's possible to add some heuristics and/or an explicit key to keep the old behaviour for reasonably small datasets? Great question. I've thought about that too and had a short discussion with other developers in the community meeting. My reasons for not doing this are: 1) code complexity: there are lots of code that assume the offset type to be concrete, esp. in SIMD specialized functions, generalizing them is not trivial; 2) runtime overhead: since the data size is very likely unknown before the computation, the bottom line would be to use the 32-bit offset until certain limit (4GB) has reached as the data accumulates, by when we switch to 64-bit offset - a huge memory copy from the 32-bit offset buffer to the 64-bit offset buffer would happen. Thus the decision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
