neilconway commented on issue #18181: URL: https://github.com/apache/datafusion/issues/18181#issuecomment-3909302532
I spent a little while looking at this. Some observations: * I tried rewriting the repro script to pass a constant array as the second argument to `array_has_any`. This did not significantly improve performance. * Digging into `array_has_any`, it doesn't optimize for the case where one argument is a constant/scalar. That means we do N*M comparisons per row, rather than building a hash table on the fixed array and just doing a probe for each element of the input array. This is relatively easy to fix and seems worth doing regardless. * Unfortunately, this fix does not improve performance for the repro script, because the haystack argument for `array_has_any` is a subquery. The fact that the subquery is invariant does not seem to be preserved through to the UDF invocation: `array_has_any` is invoked with two array arguments, not an array and a scalar. It seems like this is a shortcoming in how we optimize subqueries? I don't know the optimizer at all, so I'd be curious to get feedback on whether this analysis is correct, and if so whether this is a known shortcoming of the optimizer or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
