Zheng,

Thanks for your response..


Another question:

In the example on that wiki page, there is:

CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING,
lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(userid) INTO 256 BUCKETS;



Is it possible to specify more than one key in the CLUSTERED BY(...) clause?


Also, if I am clustering my tables, where/when would I expect to get
improved performance in Hive queries?

Thanks,
Ryan

On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote:

> I think Hive internally uses the file name to guess what the bucket the
> file should belong to.
>
> If your data have the same number of files as buckets of the data, and the
> files are named "part-00000", "part-00001", .., then it will work by just
> loading the files into the table.
>
> This also requires you to know the hashcode of the "bucket key". If the
> "bucket key" is an integer/long, the row with key "x" should belong to
> bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>"
> where y = “x % BUCKETS"
>
> Zheng
>
>
> On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote:
>
>> Hello,
>>
>> I am trying to create a table that is bucketed  and sorted by various
>> columns. My table is created as a sequence file, and I'm populating it with
>> the LOAD DATA command. However, I just came across this wiki page (
>> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables)
>> which says that the data will NOT be bucketed when inserted into the table.
>> It gives an example of using the CLUSTER BY command in a SELECT statement to
>> insert the data into the table.
>>
>> Is it possible to somehow get the same effect by using the LOAD DATA
>> command? Or do I have to create a separate bucketed and non-bucketed table
>> for my data and move it around like the example in the link above indicates?
>>
>> Thanks,
>> Ryan
>>
>>
>
>
> --
> Yours,
> Zheng
>

Reply via email to