I think Hive internally uses the file name to guess what the bucket the file
should belong to.

If your data have the same number of files as buckets of the data, and the
files are named "part-00000", "part-00001", .., then it will work by just
loading the files into the table.

This also requires you to know the hashcode of the "bucket key". If the
"bucket key" is an integer/long, the row with key "x" should belong to
bucket “x % BUCKETS", so make sure you put that row into "part-0000<y>"
where y = “x % BUCKETS"

Zheng

On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote:

> Hello,
>
> I am trying to create a table that is bucketed  and sorted by various
> columns. My table is created as a sequence file, and I'm populating it with
> the LOAD DATA command. However, I just came across this wiki page (
> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables)
> which says that the data will NOT be bucketed when inserted into the table.
> It gives an example of using the CLUSTER BY command in a SELECT statement to
> insert the data into the table.
>
> Is it possible to somehow get the same effect by using the LOAD DATA
> command? Or do I have to create a separate bucketed and non-bucketed table
> for my data and move it around like the example in the link above indicates?
>
> Thanks,
> Ryan
>
>


-- 
Yours,
Zheng

Reply via email to