Hi Namit, Thanks a lot on the update. Will do that for sure. -Deepak
On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote: > Right now, bucketing information is not used in a lot of places – it is > only used in sampling. > For eg: > > If your query was: > > Select .. From Posts(tablesample 1 out of 256) a; > > Then only the first bucket will be scanned. > > Your query can be optimized, but currently it is not. Can you file a jira > on that ? > It will help us prioritize this. > > > > -namit > > > > On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote: > > Hi, > > I have the following table in Hive > Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) > INTO 256 BUCKETS; > > Since the data is hash partitioned based on the 'UserId' column, buckets > were created based on the hash value of 'UserId'. > > Now, when I issue a Select query to fetch all the posts by a particular > 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan > only the bucket to which 'UserId' is hashed to?. But, when I run this query, > I could see all the buckets being searched for the UserId. > > Moreover, I see that's there is a way to sample the table based on the > buckets. Why can't hive automatically figure out the bucket to which UserId > is hashed to and search only in that bucket? > > Can someone clarify me on this? > > Thanks, > Deepak > >
