Hey devs, Found a fun issue for you that I couldn't find anywhere else after some googling.
Let's say I have a BIGINT column named "q" in a table, and I want to do some filtering on a list of values from that column based on an IN clause. If all of the values in my IN clause happen to be INTs, then my query will work properly (i.e., it will return any rows from the table whose "q" value contains one of the values from my list) if the IN clause contains less than 20 values, but as soon as the IN clause contains more than 20 values, the query will always return zero rows. I did some explains and can see that the planner changes the strategy for the query execution when the number of values in an IN clause exceeds 20-- less than 20 generates a big (q = 1 OR q = 2 OR ...) statement, whereas >= 20 switches to a SemiJoin between the original table and an EnumerableValues object that contains the list of values from the IN clause. I think the rub here is that if the values in the IN clause look like INTs or at least one of them isn't explicitly cast to a BIGINT, then the EnumerableValues will also have type int, and the semi-join will fail b/c of the type difference (that last part is me hypothesizing, you all would obviously know better what the problem is.) I can get around the issue by either casting one of the values in the IN clause to be a BIGINT, or by explicitly including a value in the list that can't be stored as an INT, but it seemed like there should be a cleaner way to handle this case so that the query behavior doesn't change dramatically when one extra value gets added to the IN list. Thanks! Josh -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
