kccqzy commented on issue #49310:
URL: https://github.com/apache/arrow/issues/49310#issuecomment-3942441002

   Hey! I took another look at this issue which I reported. I want to point out 
two observations:
   
   * The first is that the original issue title about how the result doesn’t 
fit in the string type is inaccurate. Indeed thinking about the `if_else` 
carefully, we conclude that the result array should actually be identical to 
the `f` array. And of course the `f` array has been constructed successfully so 
the result also fits. The string is 16 bytes and there are 100 million of them, 
so it takes 1.6GB memory well within the limit.
   * The second observation is that the largest size of the array that works 
completely is 82595524. If one were to replace the three occurrences of `10**8` 
with this number, there is no segfault. This number just happens to be the 
result of `2147483647 // 26`. This suggests that the code is somehow allocating 
the space for the sum of the left side and the right side. To be clear, it 
appears that this operation only succeeds without segfault when the 
*concatenation* of left and right sides fits within the string type. 
   
   I took a quick look at the code; there is a comment “allocate data buffer 
conservatively” and then it proceeds to compute an allocation with this sum. 
GDB was able to tell me that the result of this sum is a negative value 
`-1694967296`. This is the correct sum of 2600000000 reinterpreted as a 32-bit 
integer. Since the code was storing this to int64_t I wonder whether the intent 
here is to first cast the left and right sizes to int64_t before doing the 
addition?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to