Sampling" by RaghothamMurthy

Apache Wiki Wed, 21 Jan 2009 15:26:52 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Sampling

New page:
Syntax:
{{{
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
}}}

The TABLESAMPLE clause allows the users to write queries for samples of the 
data instead of the whole table. The TABLESAMPLE clause can be added to any 
table in the FROM clause. The buckets are numbered starting from 0. 
'''colname''' indicates the column on which to sample each row in the table. 
colname can be one of the non-partition columns in the table or '''rand()''' 
indicating sampling on the entire row instead of an individual column. The rows 
of the table are 'bucketed' on the colname randomly into y buckets numbered 0 
through y. Rows which belong to bucket x are returned.  

In the following example the 3rd bucket out of the 32 buckets of the table 
source. 's' is the table alias.
{{{
SELECT * 
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; 
}}}

'''Input pruning''': Typically, TABLESAMPLE will scan the entire table and 
fetch the sample. But, that is not very efficient. Instead, the table can be  
created with a CLUSTERED BY clause which indicates the set of columns on which 
the table is hash-partitioned/clustered on. If the columns specified in the 
TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE 
scans only the required hash-partitions of the table.

Example:

So in the above example, if table 'source' was created with 'CLUSTERED BY id 
INTO 32 BUCKETS' 
{{{
    TABLESAMPLE(BUCKET 3 OUT OF 16) 
}}}
would pick out the 3rd and 19th buckets. 

On the other hand the tablesample clause
{{{
    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) 
}}}
would pick out half of the 3rd bucket.

[Hadoop Wiki] Trivial Update of "Hive/LanguageManual/Sampling" by RaghothamMurthy

Reply via email to