I think there are divergent definitions of “similar”, and it helps if you state why you want to detect similar queries.
For example, by one definition, queries are similar if they are the same modulo substituting constants. For these queries, you can share plans if you hoist literals[1]. Other queries are similar if they use the same underlying tables. (The goal here might be security or compliance.) Another definition of similarity is “can I satisfy query A using the results of query B”? Materialized view rewrite is interested in this. For some kinds of similarity, I would work on the AST (i.e. SqlNode). For other, deeper, kinds of similarity I would recommend working at the algebra level (i.e. RelNode). Then there is equivalence sets vs. similarity. I think you’re talking about equivalence sets. Q1 and Q2 are either in the same equivalence set or they are not. If they are, then f(Q1) = f(Q2), where f is some function, say a hash function or canonization function. Other kinds of similarity (e.g. [2]) might give a “distance” between queries, and queries are deemed equivalent iff that distance is 0. Julian [1] https://issues.apache.org/jira/browse/CALCITE-963 [2] https://en.wikipedia.org/wiki/Cosine_similarity <https://en.wikipedia.org/wiki/Cosine_similarity> > On Nov 30, 2015, at 11:08 AM, Nick Dimiduk <[email protected]> wrote: > > Hello. > > I have a pile of SQL queries. Is it possible to use Calcite to group these > queries into "similar" buckets? Perhaps via some hash function on a parsed > query? I see some databases' query planners provide this kind of > functionality, but mine apparently does not. > > Thanks, > Nick
