Hey Jeremy, Hive stores each "table" inside of HDFS in a folder. For example, all of your weblogs could be stored in a folder called "/hive/weblogs". If you want to partition those weblogs by day, you can use the PARTITIONED BY clause on the CREATE TABLE statement to create a subfolder for each new day, e.g. "/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's logfiles by userid, for example, Hive can hash partition your logfiles into "buckets" (subfolders) inside that day's folder, e.g. "/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To indicate your desire to have buckets, use the CLUSTERED BY clause on the CREATE TABLE statement (see http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d ).
You can also use buckets with the TABLESAMPLE operator to run Hive queries over subsets of your data; this is useful for rapidly prototyping new analyses. See http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for the TABLESAMPLE syntax. Hive folks: in case I butchered that, feel free to jump in with a more correct explanation. If it's correct, I'll toss it on the wiki. It would be good to have actual HiveQL statements using buckets on the getting started guide too, I'd imagine. Later, Jeff On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <[email protected]> wrote: > Hi list, > > I get a term named bucket when reading hive source code. what is it means? > > Thanks, > Jeremy > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com >
