bigint conversion issue w/IN clauses

Josh Wills Tue, 28 Jul 2015 17:14:29 -0700

Hey devs,

Found a fun issue for you that I couldn't find anywhere else after some
googling.


Let's say I have a BIGINT column named "q" in a table, and I want to do
some filtering on a list of values from that column based on an IN clause.
If all of the values in my IN clause happen to be INTs, then my query will
work properly (i.e., it will return any rows from the table whose "q" value
contains one of the values from my list) if the IN clause contains less
than 20 values, but as soon as the IN clause contains more than 20 values,
the query will always return zero rows.

I did some explains and can see that the planner changes the strategy for
the query execution when the number of values in an IN clause exceeds 20--
less than 20 generates a big (q = 1 OR q = 2 OR ...) statement, whereas >=
20 switches to a SemiJoin between the original table and an
EnumerableValues object that contains the list of values from the IN
clause. I think the rub here is that if the values in the IN clause look
like INTs or at least one of them isn't explicitly cast to a BIGINT, then
the EnumerableValues will also have type int, and the semi-join will fail
b/c of the type difference (that last part is me hypothesizing, you all
would obviously know better what the problem is.)

I can get around the issue by either casting one of the values in the IN
clause to be a BIGINT, or by explicitly including a value in the list that
can't be stored as an INT, but it seemed like there should be a cleaner way
to handle this case so that the query behavior doesn't change dramatically
when one extra value gets added to the IN list.

Thanks!
Josh


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

bigint conversion issue w/IN clauses

Reply via email to