[jira] Updated: (HBASE-2571) Coprocessors: Minitables

Andrew Purtell (JIRA) Wed, 19 May 2010 10:04:15 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Purtell updated HBASE-2571:
----------------------------------

    Description: 
>From 
>http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf :

{quote}
MINITABLES: SAMPLING BIGTABLE
Alberto Lerner and S. Muthukrishnan

[...] Because of [BigTable] semantics and storing scheme, skipping N rows is 
not feasible without actually reading them. Even finding the count of rows in a 
Bigtable at any point in time can be done only probabilistically. On the bright 
side, since Bigtable does not provide a relational query engine, we do not need 
to consider what are suitable sampling methods for various relational operators 
(like joins) or take into account how sampling errors compound with increasing 
levels of query composition. 

_Uniform Random Sampling_

Our sampling scheme extracts and presents a sample of a Bigtable's rows as if 
it were a Bigtable itself, which we call a Minitable. The rationale here is 
that code written to run against a Bigtable can run unchanged against a sample 
thereof. Our sampling is based on a hash scheme. We pick a convenient hash 
function that maps the rowname space into a very large keyspace (e.g., a ax+b 
mod p function, where p is as large as 2128). The rows falling into the first 
fp keys where f is the relative sample size (it is a fraction), would belong in 
the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0, 
fp−1], then add r to the sample. It is easy to see that the expected size of 
the sample is f * 100% of the Bigtable rowcount independent of the rowcount, 
and the probability that a particular row r is in the sample is f, as desired. 
This hash-based sampling method supports maintenance of the sample with each 
Bigtable mutation (insert, update, or deletion). Only the system may forward 
relevant mutations from the Bigtable to the Minitable. Otherwise, the latter 
would behave as just any other Bigtable: it could be backed up and even be 
replicated. We are currently deploying Minitables in the repository of 
documents that the crawling system generates. Several Minitables, each with a 
different sample factor, allow that system to compute aggregates much faster 
and more often.

_Biased Sampling_

Uniform random sampling is quite useful but some scenarios require biased 
sampling methods. We are currently working on one such extension that we call 
Mask Sampling. In this scheme, the decision to select a row to the sample is 
still based on its rowname but now a user may specify a mask m over it. The 
mask, which can be a regular expression that matches portions of a rowname, is 
used to group rows together. Two rows belong to a same group if their masks 
result in the same string. Mask sampling guarantees that if a group is selected 
to the sample, that group will be adequately represented there, regardless of 
that group's relative size.
{quote}

Clearly minitables can be constructed on the fly by a coprocessor attached to 
the source table.

> Coprocessors: Minitables
> ------------------------
>
>                 Key: HBASE-2571
>                 URL: https://issues.apache.org/jira/browse/HBASE-2571
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>
> From 
> http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf 
> :
> {quote}
> MINITABLES: SAMPLING BIGTABLE
> Alberto Lerner and S. Muthukrishnan
> [...] Because of [BigTable] semantics and storing scheme, skipping N rows is 
> not feasible without actually reading them. Even finding the count of rows in 
> a Bigtable at any point in time can be done only probabilistically. On the 
> bright side, since Bigtable does not provide a relational query engine, we do 
> not need to consider what are suitable sampling methods for various 
> relational operators (like joins) or take into account how sampling errors 
> compound with increasing levels of query composition. 
> _Uniform Random Sampling_
> Our sampling scheme extracts and presents a sample of a Bigtable's rows as if 
> it were a Bigtable itself, which we call a Minitable. The rationale here is 
> that code written to run against a Bigtable can run unchanged against a 
> sample thereof. Our sampling is based on a hash scheme. We pick a convenient 
> hash function that maps the rowname space into a very large keyspace (e.g., a 
> ax+b mod p function, where p is as large as 2128). The rows falling into the 
> first fp keys where f is the relative sample size (it is a fraction), would 
> belong in the sample. Formally, we pick a hash function h : R −> 0..p and if 
> h(r) E [0, fp−1], then add r to the sample. It is easy to see that the 
> expected size of the sample is f * 100% of the Bigtable rowcount independent 
> of the rowcount, and the probability that a particular row r is in the sample 
> is f, as desired. This hash-based sampling method supports maintenance of the 
> sample with each Bigtable mutation (insert, update, or deletion). Only the 
> system may forward relevant mutations from the Bigtable to the Minitable. 
> Otherwise, the latter would behave as just any other Bigtable: it could be 
> backed up and even be replicated. We are currently deploying Minitables in 
> the repository of documents that the crawling system generates. Several 
> Minitables, each with a different sample factor, allow that system to compute 
> aggregates much faster and more often.
> _Biased Sampling_
> Uniform random sampling is quite useful but some scenarios require biased 
> sampling methods. We are currently working on one such extension that we call 
> Mask Sampling. In this scheme, the decision to select a row to the sample is 
> still based on its rowname but now a user may specify a mask m over it. The 
> mask, which can be a regular expression that matches portions of a rowname, 
> is used to group rows together. Two rows belong to a same group if their 
> masks result in the same string. Mask sampling guarantees that if a group is 
> selected to the sample, that group will be adequately represented there, 
> regardless of that group's relative size.
> {quote}
> Clearly minitables can be constructed on the fly by a coprocessor attached to 
> the source table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2571) Coprocessors: Minitables

Reply via email to