Is that the same meaning of hash partition? On Thu, Jan 8, 2009 at 4:52 PM, Jeff Hammerbacher <[email protected]>wrote:
> Hey Jeremy, > > Hive stores each "table" inside of HDFS in a folder. For example, all of > your weblogs could be stored in a folder called "/hive/weblogs". If you want > to partition those weblogs by day, you can use the PARTITIONED BY clause on > the CREATE TABLE statement to create a subfolder for each new day, e.g. > "/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's > logfiles by userid, for example, Hive can hash partition your logfiles into > "buckets" (subfolders) inside that day's folder, e.g. > "/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To > indicate your desire to have buckets, use the CLUSTERED BY clause on the > CREATE TABLE statement (see > http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d > ). > > You can also use buckets with the TABLESAMPLE operator to run Hive queries > over subsets of your data; this is useful for rapidly prototyping new > analyses. See > http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for > the TABLESAMPLE syntax. > > Hive folks: in case I butchered that, feel free to jump in with a more > correct explanation. If it's correct, I'll toss it on the wiki. It would be > good to have actual HiveQL statements using buckets on the getting started > guide too, I'd imagine. > > Later, > Jeff > > > On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <[email protected]> wrote: > >> Hi list, >> >> I get a term named bucket when reading hive source code. what is it means? >> >> Thanks, >> Jeremy >> -- >> My research interests are distributed systems, parallel computing and >> bytecode based virtual machine. >> >> http://coderplay.javaeye.com >> > > -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
