steveloughran commented on PR #55928:
URL: https://github.com/apache/spark/pull/55928#issuecomment-4477357291

   The iceberg impl. sorts all input, irrespective of size, and then binary 
searches. 
   
   I'm wondering if that's a better approach.
   
   Here every lookup on a large variant with be, on average, take n/2 lookups. 
For `k` lookups, total cost of the operation is `k * n / 2`.
   
   Java's timsort sort averages 'O(n * lg(n))`, after which each binary search 
takes lg(n) operations. For `k` lookups, total cost is `k * lg(n) + n * lg(n)`
   
   For sorting + binary to be better than
   ```
   k * n / 2 > (k + n) * lg (n)
   ```
   
   Unless I've got my numbers wrong. You're going to need large value of k 
before it's worth sorting.
   
   special case, all values get looked up. The threshold is
   ```
   n * n / 2 > n * 2 * lg(n)
   n / 2 = 2 * lg(n)
   n = 4 * lg(n)
   n / lg(n) = 4
   ```
   
   which can be solved for n = 16.
   
   So we could say "sort if length > 16", but again, that's assuming every 
value is resolved, which we don't know in advance.
   
   If only a few values are looked up, just linear scan them, as here, is the 
right thing to do.
   
   To conclude: I think this simple check is the right strategy, unless and 
until more data is collected on what variant structures actually get used in 
production systems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to