imply-cheddar commented on PR #12277:
URL: https://github.com/apache/druid/pull/12277#issuecomment-1258805916

   On the IN filter benchmarks, I've definitely seen client-generated queries 
with large IN filters.  Looker generates them sometimes, other tools also 
generate them and there are definitely applications that generate them.  It is 
programmatically generated, yes, but it also comes from external sources.  I 
forget if we've fixed this or not yet, but this is common enough that we have 
(or had?) a known issue around SQL parsing of large IN filters where it first 
puts them all into ORs before converting back to IN and our planning code for 
some reason likes to do `O(n^2)` passes over ORs.  Having a benchmark that also 
covers that case would be nice.
   
   When I look at the massive UNION ALL query in query 19, there is an actual 
difference between the timings for that query between GenericIndexed and the 
FrontCodedIndex.  I just wonder if that's because of the IN clause at the very 
end of that query...
   
   Just looked again at the `FrontCodedIndexBenchmark` and I see the 
`GenericIndexed` in there too.  I can't help but wonder if the differences 
there are actually from the fact that `FrontCoded` is dealing with UTF8 bytes 
instead of `String`.  I.e. all of the glut of dealing with String is equivalent 
to the glut of the extra objects and they are just equaling each other out.  If 
that's the case, then there's even more benefit to be had from the 
`FrontCodedIndexBenchmark`.  The other thing that's much more difficult to 
measure, but that we have seen is that garbage invokes GC and GC stops *all* 
queries from making progress.  This can limit the overall level of concurrency 
of the distributed system and can be difficult to measure in a benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to