[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Apache Wiki Mon, 16 Aug 2010 14:50:56 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=19&rev2=20

--------------------------------------------------

  limit 11;
  }}}
  
- This works by ordering all of the rows in a sample of the table (using a 
single reducer), and then selecting every nth row (here n=910000).  The value 
of n is chosen by dividing the total number of rows in the table by the desired 
number of ranges, e.g. 12 in this case (one more than the number of 
partitioning keys produced by the LIMIT clause).  The assumption here is that 
the distribution in the sample matches the overall distribution in the table; 
if this is not the case, the resulting partition keys will lead to skew in the 
parallel sort.
+ This works by ordering all of the rows in a .01% sample of the table (using a 
single reducer), and then selecting every nth row (here n=910000).  The value 
of n is chosen by dividing the total number of rows in the sample by the 
desired number of ranges, e.g. 12 in this case (one more than the number of 
partitioning keys produced by the LIMIT clause).  The assumption here is that 
the distribution in the sample matches the overall distribution in the table; 
if this is not the case, the resulting partition keys will lead to skew in the 
parallel sort.
  
  Once you have your sampling query defined, the next step is to save its 
results to a properly formatted file which will be used in a subsequent step.  
To do this, run commands like the following:

[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Reply via email to