westonpace commented on issue #36295:
URL: https://github.com/apache/arrow/issues/36295#issuecomment-1613973778

   Indeed, this was the case.  In the internals of the "group by" code is a 
hash map.  This hash map stores keys in row-major order using the row-format 
(row-major order is commonly used with hash-maps since it's expected they will 
be accessed in random non-sequentially).  This structure is called the 
RowTableImpl: 
https://github.com/apache/arrow/blob/9d92ed4dc4d6d691668609641043561137e91401/cpp/src/arrow/compute/row/row_internal.h#L162
   
   In the example we are using 6 columns (if we include the index) of int8 as 
the hash keys.  This means each row in our row table is 48 bytes large.  The 
test uses 100,000,000 rows. So we end up with 4,800,000,000 bytes which is over 
the int32 limit of 4,294,967,296.  With 5 columns we only have 4,000,000,000 
bytes and so it fits.
   
   This _should_ be ok.  The `RowTableImpl` appears to have been designed to 
store 2^64 bytes of data.  In practice we do not have any unit tests covering 
this case and so, unsurprisingly, we have mistakes.  The following patch fixes 
the reproduction from the OP: 
https://github.com/apache/arrow/commit/c901d1eacd614cdb0f44cd736034be098b16aae0
   
   Unfortunately, I still have pretty low confidence in the robustness of this 
code when working with tables this large.  There are different paths for 
boolean fields, dictionary fields, string-like fields, and all of these paths 
are futher parameterized by AVX2 vs. not-AVX2 and we no longer have CI for AVX2.
   
   We are missing a set of unit tests to cover these scenarios and I don't 
think I'll have time to add them for a while.  So I'm uncertain at the moment 
whether it would be better to put in this short-term fix or to simply reject 
and fail a plan once the keys table exceeds 4GB.  If anyone is interested in 
taking on the effort of adding these unit tests I'd be happy to explain 
further.  In the meantime I'm not sure when I'll get to it.  I'm marking this 
blocker to force myself to make some kind of decision on this in the 13.0.0 
timeframe.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to