[
https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell updated HBASE-2571:
----------------------------------
Description:
>From
>http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf :
{quote}
MINITABLES: SAMPLING BIGTABLE
Alberto Lerner and S. Muthukrishnan
[...] Because of [BigTable] semantics and storing scheme, skipping N rows is
not feasible without actually reading them. Even finding the count of rows in a
Bigtable at any point in time can be done only probabilistically. On the bright
side, since Bigtable does not provide a relational query engine, we do not need
to consider what are suitable sampling methods for various relational operators
(like joins) or take into account how sampling errors compound with increasing
levels of query composition.
_Uniform Random Sampling_
Our sampling scheme extracts and presents a sample of a Bigtable's rows as if
it were a Bigtable itself, which we call a Minitable. The rationale here is
that code written to run against a Bigtable can run unchanged against a sample
thereof. Our sampling is based on a hash scheme. We pick a convenient hash
function that maps the rowname space into a very large keyspace (e.g., a ax+b
mod p function, where p is as large as 2128). The rows falling into the first
fp keys where f is the relative sample size (it is a fraction), would belong in
the sample. Formally, we pick a hash function h : R −> 0..p and if h(r) E [0,
fp−1], then add r to the sample. It is easy to see that the expected size of
the sample is f * 100% of the Bigtable rowcount independent of the rowcount,
and the probability that a particular row r is in the sample is f, as desired.
This hash-based sampling method supports maintenance of the sample with each
Bigtable mutation (insert, update, or deletion). Only the system may forward
relevant mutations from the Bigtable to the Minitable. Otherwise, the latter
would behave as just any other Bigtable: it could be backed up and even be
replicated. We are currently deploying Minitables in the repository of
documents that the crawling system generates. Several Minitables, each with a
different sample factor, allow that system to compute aggregates much faster
and more often.
_Biased Sampling_
Uniform random sampling is quite useful but some scenarios require biased
sampling methods. We are currently working on one such extension that we call
Mask Sampling. In this scheme, the decision to select a row to the sample is
still based on its rowname but now a user may specify a mask m over it. The
mask, which can be a regular expression that matches portions of a rowname, is
used to group rows together. Two rows belong to a same group if their masks
result in the same string. Mask sampling guarantees that if a group is selected
to the sample, that group will be adequately represented there, regardless of
that group's relative size.
{quote}
Clearly minitables can be constructed on the fly by a coprocessor attached to
the source table.
> Coprocessors: Minitables
> ------------------------
>
> Key: HBASE-2571
> URL: https://issues.apache.org/jira/browse/HBASE-2571
> Project: Hadoop HBase
> Issue Type: Sub-task
> Reporter: Andrew Purtell
>
> From
> http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf
> :
> {quote}
> MINITABLES: SAMPLING BIGTABLE
> Alberto Lerner and S. Muthukrishnan
> [...] Because of [BigTable] semantics and storing scheme, skipping N rows is
> not feasible without actually reading them. Even finding the count of rows in
> a Bigtable at any point in time can be done only probabilistically. On the
> bright side, since Bigtable does not provide a relational query engine, we do
> not need to consider what are suitable sampling methods for various
> relational operators (like joins) or take into account how sampling errors
> compound with increasing levels of query composition.
> _Uniform Random Sampling_
> Our sampling scheme extracts and presents a sample of a Bigtable's rows as if
> it were a Bigtable itself, which we call a Minitable. The rationale here is
> that code written to run against a Bigtable can run unchanged against a
> sample thereof. Our sampling is based on a hash scheme. We pick a convenient
> hash function that maps the rowname space into a very large keyspace (e.g., a
> ax+b mod p function, where p is as large as 2128). The rows falling into the
> first fp keys where f is the relative sample size (it is a fraction), would
> belong in the sample. Formally, we pick a hash function h : R −> 0..p and if
> h(r) E [0, fp−1], then add r to the sample. It is easy to see that the
> expected size of the sample is f * 100% of the Bigtable rowcount independent
> of the rowcount, and the probability that a particular row r is in the sample
> is f, as desired. This hash-based sampling method supports maintenance of the
> sample with each Bigtable mutation (insert, update, or deletion). Only the
> system may forward relevant mutations from the Bigtable to the Minitable.
> Otherwise, the latter would behave as just any other Bigtable: it could be
> backed up and even be replicated. We are currently deploying Minitables in
> the repository of documents that the crawling system generates. Several
> Minitables, each with a different sample factor, allow that system to compute
> aggregates much faster and more often.
> _Biased Sampling_
> Uniform random sampling is quite useful but some scenarios require biased
> sampling methods. We are currently working on one such extension that we call
> Mask Sampling. In this scheme, the decision to select a row to the sample is
> still based on its rowname but now a user may specify a mask m over it. The
> mask, which can be a regular expression that matches portions of a rowname,
> is used to group rows together. Two rows belong to a same group if their
> masks result in the same string. Mask sampling guarantees that if a group is
> selected to the sample, that group will be adequately represented there,
> regardless of that group's relative size.
> {quote}
> Clearly minitables can be constructed on the fly by a coprocessor attached to
> the source table.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.