Map-side aggregation and map-side joins on the clustered keys should perform 
better?

________________________________
From: Zheng Shao <[email protected]>
Reply-To: <[email protected]>
Date: Wed, 28 Oct 2009 15:25:37 -0700
To: <[email protected]>
Subject: Re: Creating and populating bucketed tables

Hi Ryan,

We never tried multi-column bucket so I am not sure.

You will get better performance when you do "SELECT ... FROM mytable 
TABLESAMPLE(BUCKET 1 OUT OF 256);". Hive will submit only one file out of the 
256 files from the table to the map-reduce job.

Zheng

On Wed, Oct 28, 2009 at 3:19 PM, Ryan LeCompte <[email protected]> wrote:
Zheng,

Thanks for your response..


Another question:

In the example on that wiki page, there is:

CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname 
STRING)
COMMENT 'A bucketed copy of user_info'


PARTITIONED BY(ds STRING)
CLUSTERED BY(userid) INTO 256 BUCKETS;


Is it possible to specify more than one key in the CLUSTERED BY(...) clause?

Also, if I am clustering my tables, where/when would I expect to get improved 
performance in Hive queries?

Thanks,
Ryan


On Sat, Oct 24, 2009 at 6:56 PM, Zheng Shao <[email protected]> wrote:
I think Hive internally uses the file name to guess what the bucket the file 
should belong to.

If your data have the same number of files as buckets of the data, and the 
files are named "part-00000", "part-00001", .., then it will work by just 
loading the files into the table.

This also requires you to know the hashcode of the "bucket key". If the "bucket 
key" is an integer/long, the row with key "x" should belong to bucket "x % 
BUCKETS", so make sure you put that row into "part-0000<y>" where y = "x % 
BUCKETS"

Zheng


On Sat, Oct 24, 2009 at 3:00 AM, Ryan LeCompte <[email protected]> wrote:
Hello,

I am trying to create a table that is bucketed  and sorted by various columns. 
My table is created as a sequence file, and I'm populating it with the LOAD 
DATA command. However, I just came across this wiki page 
(http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables) which 
says that the data will NOT be bucketed when inserted into the table. It gives 
an example of using the CLUSTER BY command in a SELECT statement to insert the 
data into the table.

Is it possible to somehow get the same effect by using the LOAD DATA command? 
Or do I have to create a separate bucketed and non-bucketed table for my data 
and move it around like the example in the link above indicates?

Thanks,
Ryan




--
Yours,
Zheng




--
Yours,
Zheng

Reply via email to