Hey Jeremy,

Hive stores each "table" inside of HDFS in a folder. For example, all of
your weblogs could be stored in a folder called "/hive/weblogs". If you want
to partition those weblogs by day, you can use the PARTITIONED BY clause on
the CREATE TABLE statement to create a subfolder for each new day, e.g.
"/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's
logfiles by userid, for example, Hive can hash partition your logfiles into
"buckets" (subfolders) inside that day's folder, e.g.
"/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To
indicate your desire to have buckets, use the CLUSTERED BY clause on the
CREATE TABLE statement (see
http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d
).

You can also use buckets with the TABLESAMPLE operator to run Hive queries
over subsets of your data; this is useful for rapidly prototyping new
analyses. See
http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for
the TABLESAMPLE syntax.

Hive folks: in case I butchered that, feel free to jump in with a more
correct explanation. If it's correct, I'll toss it on the wiki. It would be
good to have actual HiveQL statements using buckets on the getting started
guide too, I'd imagine.

Later,
Jeff

On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <[email protected]> wrote:

> Hi list,
>
> I get a term named bucket when reading hive source code. what is it means?
>
> Thanks,
> Jeremy
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>

Reply via email to