Re: [I] Improve performance of `array_has` [datafusion]

via GitHub Mon, 16 Feb 2026 08:08:33 -0800


neilconway commented on issue #18181:
URL: https://github.com/apache/datafusion/issues/18181#issuecomment-3909302532


   I spent a little while looking at this. Some observations:
   
   * I tried rewriting the repro script to pass a constant array as the second 
argument to `array_has_any`. This did not significantly improve performance.
   * Digging into `array_has_any`, it doesn't optimize for the case where one 
argument is a constant/scalar. That means we do N*M comparisons per row, rather 
than building a hash table on the fixed array and just doing a probe for each 
element of the input array. This is relatively easy to fix and seems worth 
doing regardless.
   * Unfortunately, this fix does not improve performance for the repro script, 
because the haystack argument for `array_has_any` is a subquery. The fact that 
the subquery is invariant does not seem to be preserved through to the UDF 
invocation: `array_has_any` is invoked with two array arguments, not an array 
and a scalar. It seems like this is a shortcoming in how we optimize subqueries?
   
   I don't know the optimizer at all, so I'd be curious to get feedback on 
whether this analysis is correct, and if so whether this is a known shortcoming 
of the optimizer or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of `array_has` [datafusion]

Reply via email to