pitrou commented on PR #41700: URL: https://github.com/apache/arrow/pull/41700#issuecomment-2535452180
By building on this (arguably simplified) analysis: > we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values. and assuming the following known values: * `n_values`: length of the values input * `n_indices`: length of the indices input (governing the output length) * `value_width`: byte width of the individual values Then a simple heuristic could be to concatenate iff `n_indices * 16 > n_values * value_width`. This wouldn't take into account the larger computational cost associated with chunked indexing, but at least it would disable the chunked resolution approach when it doesn't make sense at all. (btw, a moderate improvement could probably be achieved by using `CompressedChunkLocation`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
