Hi All, Say i have a table with below schema:
CREATE TABLE foo (id INT) CLUSTERED BY (id) INTO 8 BUCKETS STORED AS ORC; and when we issue the following query, its doing a "Full table scan" SELECT * FROM foo WHERE id=<some integer> After doing some searching on the net, i found that "table sample" seems to be one of the ways to resolve this and the query would be written in this manner: SELECT * FROM foo TABLE SAMPLE(BUCKET <bucket id of some_integer: say 2> of 8) where id=<some_integer> I was expecting the above query to read only the 2nd bucket but to my surprise it did full table scan again. I understand that partitioning in hive is the way to go for such queries by i have two points on why we need such filtering techniques for buckets as well. 1. Partitioning may not be suitable / preferred when there are lots of partitions (high cardinality columns) 2. Buckets just works out of the box, without specifying things like partitioning keys etc.. in the queries like "insert clauses". I was wondering if there are any technical constraints on why we were not able to restrict the scan only to that bucket for such pointed queries? -- Thanks, M. Varadharajan ------------------------------------------------ "Experience is what you get when you didn't get what you wanted" -By Prof. Randy Pausch in "The Last Lecture" My Journal :- http://varadharajan.in