Bucketed Tables" by PaulYang

Apache Wiki Thu, 01 Apr 2010 16:14:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/DDL/BucketedTables" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables?action=diff&rev1=7&rev2=8

--------------------------------------------------

  
  First there’s table creation:
  {{{
- CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname 
STRING)
+ CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname 
STRING)
  COMMENT 'A bucketed copy of user_info'
  PARTITIONED BY(ds STRING)
- CLUSTERED BY(userid) INTO 256 BUCKETS;
+ CLUSTERED BY(user_id) INTO 256 BUCKETS;
  }}}
  
- Then we populate this, making sure to use 256 reducers:
+ Note that we specify a column (user_id) to base the bucketing.
+ 
+ Then we populate the table
  {{{
+ set hive.enforce.bucketing = true;  
+ FROM user_id
- set mapred.reduce.tasks = 256;    
- FROM (
-     FROM user_info u
-     SELECT CAST(userid AS BIGINT) % 256 AS bucket_id, userid, firstname, 
lastname
-     WHERE d.ds='2009-02-25'
-     CLUSTER BY bucket_id
-     ) c
  INSERT OVERWRITE TABLE user_info_bucketed
  PARTITION (ds='2009-02-25')
- SELECT userid, firstname, lastname;
+ SELECT userid, firstname, lastname WHERE ds='2009-02-25';
  }}}
  
- Note that I’m clustering by the integer version of userid.  This might 
otherwise cluster by userid as a STRING (depending on the type of userid in 
user_info), which uses a totally different hash.  It's important for the 
hashing function to be of the correct data type, since otherwise we'll expect 
userids in bucket 1 to satisfy (big_hash(userid) mod 256 == 0), but instead 
we'll be getting (string_hash(userid) mod 256 == 0).  It's also good form to 
have all of your tables use the same type (eg, BIGINT instead of STRING) since 
that way your sampling from multiple tables will give you the same userids, 
letting join efficiently sample and join.
+ The command {{{set hive.enforce.bucketing = true; }}} allows the correct 
number of reducers and the cluster by column to be automatically selected based 
on the table. Otherwise, you would need to set the number of reducers to be the 
same as the number of buckets a la {{{set mapred.reduce.tasks = 256;}}} and 
have {{{CLUSTER BY ...}}} clause in the select.
  
+ How does Hive distribute the rows across the buckets? In general, the bucket 
number is determined by the expression {{{hash_function(bucketing_column) mod 
num_buckets}}}. (There's a '0x7FFFFFFF in there too, but that's not that 
important). The hash_function depends on the type of the bucketing column. For 
an int, it's easy, {{{hash_int(i) == i}}}. For example, if user_id were an int, 
and there were 10 buckets, we would expect all user_id's that end in 0 to be in 
bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other 
datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the 
same as the BIGINT. And the hash of a string or a complex datatype will be some 
number that's derived from the value, but not anything humanly-recognizable. 
For example, if user_id were a STRING, then the user_id's in bucket 1 would 
probably not end in 0. In general, though, distributing rows based on the hash 
will give you a even distribution in the buckets.
+ 
+ So, what can go wrong? As long as you {{{set hive.enforce.bucketing = 
true}}}, and use the syntax above, the tables should be populated properly. 
Things can go wrong if the bucketing column type is different during the insert 
and on read, or if you manually cluster by a value that's different from the 
table definition 
+

[Hadoop Wiki] Update of "Hive/LanguageManual/DDL/Bucketed Tables" by PaulYang

Reply via email to