westonpace commented on issue #36295: URL: https://github.com/apache/arrow/issues/36295#issuecomment-1613973778
Indeed, this was the case. In the internals of the "group by" code is a hash map. This hash map stores keys in row-major order using the row-format (row-major order is commonly used with hash-maps since it's expected they will be accessed in random non-sequentially). This structure is called the RowTableImpl: https://github.com/apache/arrow/blob/9d92ed4dc4d6d691668609641043561137e91401/cpp/src/arrow/compute/row/row_internal.h#L162 In the example we are using 6 columns (if we include the index) of int8 as the hash keys. This means each row in our row table is 48 bytes large. The test uses 100,000,000 rows. So we end up with 4,800,000,000 bytes which is over the int32 limit of 4,294,967,296. With 5 columns we only have 4,000,000,000 bytes and so it fits. This _should_ be ok. The `RowTableImpl` appears to have been designed to store 2^64 bytes of data. In practice we do not have any unit tests covering this case and so, unsurprisingly, we have mistakes. The following patch fixes the reproduction from the OP: https://github.com/apache/arrow/commit/c901d1eacd614cdb0f44cd736034be098b16aae0 Unfortunately, I still have pretty low confidence in the robustness of this code when working with tables this large. There are different paths for boolean fields, dictionary fields, string-like fields, and all of these paths are futher parameterized by AVX2 vs. not-AVX2 and we no longer have CI for AVX2. We are missing a set of unit tests to cover these scenarios and I don't think I'll have time to add them for a while. So I'm uncertain at the moment whether it would be better to put in this short-term fix or to simply reject and fail a plan once the keys table exceeds 4GB. If anyone is interested in taking on the effort of adding these unit tests I'd be happy to explain further. In the meantime I'm not sure when I'll get to it. I'm marking this blocker to force myself to make some kind of decision on this in the 13.0.0 timeframe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
