kccqzy commented on issue #49310: URL: https://github.com/apache/arrow/issues/49310#issuecomment-3942441002
Hey! I took another look at this issue which I reported. I want to point out two observations: * The first is that the original issue title about how the result doesn’t fit in the string type is inaccurate. Indeed thinking about the `if_else` carefully, we conclude that the result array should actually be identical to the `f` array. And of course the `f` array has been constructed successfully so the result also fits. The string is 16 bytes and there are 100 million of them, so it takes 1.6GB memory well within the limit. * The second observation is that the largest size of the array that works completely is 82595524. If one were to replace the three occurrences of `10**8` with this number, there is no segfault. This number just happens to be the result of `2147483647 // 26`. This suggests that the code is somehow allocating the space for the sum of the left side and the right side. To be clear, it appears that this operation only succeeds without segfault when the *concatenation* of left and right sides fits within the string type. I took a quick look at the code; there is a comment “allocate data buffer conservatively” and then it proceeds to compute an allocation with this sum. GDB was able to tell me that the result of this sum is a negative value `-1694967296`. This is the correct sum of 2600000000 reinterpreted as a 32-bit integer. Since the code was storing this to int64_t I wonder whether the intent here is to first cast the left and right sizes to int64_t before doing the addition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
