[
https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell updated HBASE-2571:
----------------------------------
Issue Type: New Feature (was: Sub-task)
Parent: (was: HBASE-2000)
> Coprocessors: Minitables
> ------------------------
>
> Key: HBASE-2571
> URL: https://issues.apache.org/jira/browse/HBASE-2571
> Project: HBase
> Issue Type: New Feature
> Components: coprocessors
> Reporter: Andrew Purtell
>
> From
> http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf
> :
> {quote}
> MINITABLES: SAMPLING BIGTABLE
> Alberto Lerner and S. Muthukrishnan
> [...] Because of [BigTable] semantics and storing scheme, skipping N rows is
> not feasible without actually reading them. Even finding the count of rows in
> a Bigtable at any point in time can be done only probabilistically. On the
> bright side, since Bigtable does not provide a relational query engine, we do
> not need to consider what are suitable sampling methods for various
> relational operators (like joins) or take into account how sampling errors
> compound with increasing levels of query composition.
> _Uniform Random Sampling_
> Our sampling scheme extracts and presents a sample of a Bigtable's rows as if
> it were a Bigtable itself, which we call a Minitable. The rationale here is
> that code written to run against a Bigtable can run unchanged against a
> sample thereof. Our sampling is based on a hash scheme. We pick a convenient
> hash function that maps the rowname space into a very large keyspace (e.g., a
> ax+b mod p function, where p is as large as 2128). The rows falling into the
> first fp keys where f is the relative sample size (it is a fraction), would
> belong in the sample. Formally, we pick a hash function h : R −> 0..p and if
> h(r) E [0, fp−1], then add r to the sample. It is easy to see that the
> expected size of the sample is f * 100% of the Bigtable rowcount independent
> of the rowcount, and the probability that a particular row r is in the sample
> is f, as desired. This hash-based sampling method supports maintenance of the
> sample with each Bigtable mutation (insert, update, or deletion). Only the
> system may forward relevant mutations from the Bigtable to the Minitable.
> Otherwise, the latter would behave as just any other Bigtable: it could be
> backed up and even be replicated. We are currently deploying Minitables in
> the repository of documents that the crawling system generates. Several
> Minitables, each with a different sample factor, allow that system to compute
> aggregates much faster and more often.
> _Biased Sampling_
> Uniform random sampling is quite useful but some scenarios require biased
> sampling methods. We are currently working on one such extension that we call
> Mask Sampling. In this scheme, the decision to select a row to the sample is
> still based on its rowname but now a user may specify a mask m over it. The
> mask, which can be a regular expression that matches portions of a rowname,
> is used to group rows together. Two rows belong to a same group if their
> masks result in the same string. Mask sampling guarantees that if a group is
> selected to the sample, that group will be adequately represented there,
> regardless of that group's relative size.
> {quote}
> Clearly minitables can be constructed on the fly by a coprocessor attached to
> the source table.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira