I know somebody who is querying a very large table and has trouble with pushdown.
They are looking for values indexed by primary key with a query like "select * from table where key in s". If s has a very small number of values, this turns into primary key access, but if there are more than just a few, it becomes a scan. The situation that would be interesting to detect is where s has a few tightly clustered groups. The ideal strategy would be to scan each group. How this might be detected isn't clear to me, but it would make a massive difference to this kind of query. Currently, the best alternative is to try to avoid this kind of query and build a data flow such that each cluster of keys flows into a separate query. This would be made easier if a common table expression (CTE) query could be done without having the optimizer try to globally optimize back to a single big scan. Anyway, I have absolutely no concrete suggestions for making this work, but the need is there. On Tue, Aug 24, 2021 at 4:39 AM luoc <[email protected]> wrote: > Hello Guys, > Will you use Drill to query Apache HBase? If so, what new feature would > you like to see in HBase storage plugin? In addition, Drill supported the > Apache Cassandra since 1.19. > Absolutely… Could you tell me what your most common storage plugin (or > data format) are? Thanks for your time. > > > -- luoc
