steveloughran commented on PR #55928: URL: https://github.com/apache/spark/pull/55928#issuecomment-4477357291
The iceberg impl. sorts all input, irrespective of size, and then binary searches. I'm wondering if that's a better approach. Here every lookup on a large variant with be, on average, take n/2 lookups. For `k` lookups, total cost of the operation is `k * n / 2`. Java's timsort sort averages 'O(n * lg(n))`, after which each binary search takes lg(n) operations. For `k` lookups, total cost is `k * lg(n) + n * lg(n)` For sorting + binary to be better than ``` k * n / 2 > (k + n) * lg (n) ``` Unless I've got my numbers wrong. You're going to need large value of k before it's worth sorting. special case, all values get looked up. The threshold is ``` n * n / 2 > n * 2 * lg(n) n / 2 = 2 * lg(n) n = 4 * lg(n) n / lg(n) = 4 ``` which can be solved for n = 16. So we could say "sort if length > 16", but again, that's assuming every value is resolved, which we don't know in advance. If only a few values are looked up, just linear scan them, as here, is the right thing to do. To conclude: I think this simple check is the right strategy, unless and until more data is collected on what variant structures actually get used in production systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
