That would be great.!Thanks a lot. -Deepak
On Fri, Jul 17, 2009 at 11:46 AM, Prasad Chakka <[email protected]>wrote: > Yeah, we know of this optimization and and will add it as we start > optimizing filter queries using indexes. > > ------------------------------ > *From: *Namit Jain <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Thu, 16 Jul 2009 22:42:25 -0700 > *To: *<[email protected]> > *Subject: *Re: Selecting data based on the clustered columns > > > I am not sure if they are handling this. Let me talk to Prasad offline and > get back to you. > > > > On 7/16/09 9:49 PM, "Deepak A" <[email protected]> wrote: > > Hi Namit, > > I checked JIRA for any existing tickets on this and figured out that there > are plans to support indexing on queries. This is being discussed at > https://issues.apache.org/jira/browse/HIVE-417 > > Can you please check if what we are discussing makes sense in this content > or if it is orthogonal to this. > > -Deepak > > On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <[email protected]> wrote: > > Hi Namit, > > Thanks a lot on the update. > Will do that for sure. > > -Deepak > > > On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote: > > Right now, bucketing information is not used in a lot of places – it is > only used in sampling. > For eg: > > If your query was: > > Select .. From Posts(tablesample 1 out of 256) a; > > Then only the first bucket will be scanned. > > Your query can be optimized, but currently it is not. Can you file a jira > on that ? > It will help us prioritize this. > > > > -namit > > > > On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote: > > Hi, > > I have the following table in Hive > Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) > INTO 256 BUCKETS; > > Since the data is hash partitioned based on the 'UserId' column, buckets > were created based on the hash value of 'UserId'. > > Now, when I issue a Select query to fetch all the posts by a particular > 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan > only the bucket to which 'UserId' is hashed to?. But, when I run this query, > I could see all the buckets being searched for the UserId. > > Moreover, I see that's there is a way to sample the table based on the > buckets. Why can't hive automatically figure out the bucket to which UserId > is hashed to and search only in that bucket? > > Can someone clarify me on this? > > Thanks, > Deepak > > > > > >
