I am not sure if they are handling this. Let me talk to Prasad offline and get 
back to you.



On 7/16/09 9:49 PM, "Deepak A" <[email protected]> wrote:

Hi Namit,

I checked JIRA for any existing tickets on this and figured out that there are 
plans to support indexing on queries. This is being discussed at 
https://issues.apache.org/jira/browse/HIVE-417

Can you please check if what we are discussing makes sense in this content or 
if it is orthogonal to this.

-Deepak

On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <[email protected]> wrote:
Hi Namit,

Thanks a lot on the update.
Will do that for sure.

-Deepak


On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote:
Right now, bucketing information is not used in a lot of places - it is only 
used in sampling.
For eg:

If your query was:

Select .. From Posts(tablesample 1 out of 256) a;

Then only the first bucket will be scanned.

Your query can be optimized, but currently it is not. Can you file a jira on 
that ?
It will help us prioritize this.



-namit



On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote:

Hi,

I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) 
INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets were 
created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular 'UserId 
' (say, Select count(Id) from Posts where UserId=1), does it scan only the 
bucket to which 'UserId' is hashed to?. But, when I run this query, I could see 
all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the buckets. 
Why can't hive automatically figure out the bucket to which UserId is hashed to 
and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak




Reply via email to