Re: Creating and populating bucketed tables

Zheng Shao Wed, 28 Oct 2009 15:26:09 -0700

Hi Ryan,

We never tried multi-column bucket so I am not sure.


You will get better performance when you do "SELECT ... FROM mytable
TABLESAMPLE(BUCKET 1 OUT OF 256);". Hive will submit only one file out of
the 256 files from the table to the map-reduce job.

Zheng

On Wed, Oct 28, 2009 at 3:19 PM, Ryan LeCompte <[email protected]> wrote:

> Zheng,
>
> Thanks for your response..
>
>
> Another question:
>
> In the example on that wiki page, there is:
>
> CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname 
> STRING)
> COMMENT 'A bucketed copy of user_info'
>
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(userid) INTO 256 BUCKETS;
>
>
>
> Is it possible to specify more than one key in the CLUSTERED BY(...)
> clause?
>
> Also, if I am clustering my tables, where/when would I expect to get
> improved performance in Hive queries?
>
> Thanks,
> Ryan
>
>
> On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote:
>
>> I think Hive internally uses the file name to guess what the bucket the
>> file should belong to.
>>
>> If your data have the same number of files as buckets of the data, and the
>> files are named "part-00000", "part-00001", .., then it will work by just
>> loading the files into the table.
>>
>> This also requires you to know the hashcode of the "bucket key". If the
>> "bucket key" is an integer/long, the row with key "x" should belong to
>> bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>"
>> where y = “x % BUCKETS"
>>
>> Zheng
>>
>>
>> On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]>wrote:
>>
>>> Hello,
>>>
>>> I am trying to create a table that is bucketed  and sorted by various
>>> columns. My table is created as a sequence file, and I'm populating it with
>>> the LOAD DATA command. However, I just came across this wiki page (
>>> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables)
>>> which says that the data will NOT be bucketed when inserted into the table.
>>> It gives an example of using the CLUSTER BY command in a SELECT statement to
>>> insert the data into the table.
>>>
>>> Is it possible to somehow get the same effect by using the LOAD DATA
>>> command? Or do I have to create a separate bucketed and non-bucketed table
>>> for my data and move it around like the example in the link above indicates?
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>


-- 
Yours,
Zheng

Re: Creating and populating bucketed tables

Reply via email to