Hi Namit, I checked JIRA for any existing tickets on this and figured out that there are plans to support indexing on queries. This is being discussed at https://issues.apache.org/jira/browse/HIVE-417
Can you please check if what we are discussing makes sense in this content or if it is orthogonal to this. -Deepak On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <[email protected]> wrote: > Hi Namit, > Thanks a lot on the update. > Will do that for sure. > > -Deepak > > > On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote: > >> Right now, bucketing information is not used in a lot of places – it is >> only used in sampling. >> For eg: >> >> If your query was: >> >> Select .. From Posts(tablesample 1 out of 256) a; >> >> Then only the first bucket will be scanned. >> >> Your query can be optimized, but currently it is not. Can you file a jira >> on that ? >> It will help us prioritize this. >> >> >> >> -namit >> >> >> >> On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote: >> >> Hi, >> >> I have the following table in Hive >> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY >> (PostDate) INTO 256 BUCKETS; >> >> Since the data is hash partitioned based on the 'UserId' column, buckets >> were created based on the hash value of 'UserId'. >> >> Now, when I issue a Select query to fetch all the posts by a particular >> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan >> only the bucket to which 'UserId' is hashed to?. But, when I run this query, >> I could see all the buckets being searched for the UserId. >> >> Moreover, I see that's there is a way to sample the table based on the >> buckets. Why can't hive automatically figure out the bucket to which UserId >> is hashed to and search only in that bucket? >> >> Can someone clarify me on this? >> >> Thanks, >> Deepak >> >> >
