Selecting data based on the clustered columns

Deepak A Thu, 16 Jul 2009 03:26:13 -0700

Hi,
I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
INTO 256 BUCKETS;


Since the data is hash partitioned based on the 'UserId' column, buckets
were created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular
'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
only the bucket to which 'UserId' is hashed to?. But, when I run this query,
I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the
buckets. Why can't hive automatically figure out the bucket to which UserId
is hashed to and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak

Selecting data based on the clustered columns

Reply via email to