Re: Selecting data based on the clustered columns

Deepak A Thu, 16 Jul 2009 21:50:32 -0700

Hi Namit,
I checked JIRA for any existing tickets on this and figured out that there
are plans to support indexing on queries. This is being discussed at
https://issues.apache.org/jira/browse/HIVE-417


Can you please check if what we are discussing makes sense in this content
or if it is orthogonal to this.

-Deepak

On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <[email protected]> wrote:

> Hi Namit,
> Thanks a lot on the update.
> Will do that for sure.
>
> -Deepak
>
>
> On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote:
>
>>  Right now, bucketing information is not used in a lot of places – it is
>> only used in sampling.
>> For eg:
>>
>> If your query was:
>>
>> Select .. From Posts(tablesample 1 out of 256) a;
>>
>> Then only the first bucket will be scanned.
>>
>> Your query can be optimized, but currently it is not. Can you file a jira
>> on that ?
>> It will help us prioritize this.
>>
>>
>>
>> -namit
>>
>>
>>
>> On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote:
>>
>> Hi,
>>
>> I have the following table in Hive
>> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY
>> (PostDate) INTO 256 BUCKETS;
>>
>> Since the data is hash partitioned based on the 'UserId' column, buckets
>> were created based on the hash value of 'UserId'.
>>
>> Now, when I issue a Select query to fetch all the posts by a particular
>> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
>> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
>> I could see all the buckets being searched for the UserId.
>>
>> Moreover, I see that's there is a way to sample the table based on the
>> buckets. Why can't hive automatically figure out the bucket to which UserId
>> is hashed to and search only in that bucket?
>>
>> Can someone clarify me on this?
>>
>> Thanks,
>> Deepak
>>
>>
>

Re: Selecting data based on the clustered columns

Reply via email to