Re: Selecting data based on the clustered columns

Deepak A Thu, 16 Jul 2009 09:57:05 -0700

Hi Namit,
Thanks a lot on the update.
Will do that for sure.

-Deepak


On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <[email protected]> wrote:

>  Right now, bucketing information is not used in a lot of places – it is
> only used in sampling.
> For eg:
>
> If your query was:
>
> Select .. From Posts(tablesample 1 out of 256) a;
>
> Then only the first bucket will be scanned.
>
> Your query can be optimized, but currently it is not. Can you file a jira
> on that ?
> It will help us prioritize this.
>
>
>
> -namit
>
>
>
> On 7/16/09 3:25 AM, "Deepak A" <[email protected]> wrote:
>
> Hi,
>
> I have the following table in Hive
> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
> INTO 256 BUCKETS;
>
> Since the data is hash partitioned based on the 'UserId' column, buckets
> were created based on the hash value of 'UserId'.
>
> Now, when I issue a Select query to fetch all the posts by a particular
> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
> I could see all the buckets being searched for the UserId.
>
> Moreover, I see that's there is a way to sample the table based on the
> buckets. Why can't hive automatically figure out the bucket to which UserId
> is hashed to and search only in that bucket?
>
> Can someone clarify me on this?
>
> Thanks,
> Deepak
>
>

Re: Selecting data based on the clustered columns

Reply via email to