[
https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher Tubbs updated ACCUMULO-3913:
----------------------------------------
Issue Type: Improvement (was: Bug)
> Add per table sampling
> ----------------------
>
> Key: ACCUMULO-3913
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3913
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Keith Turner
> Fix For: 1.8.0
>
>
> I am working on prototyping adding hash based sampling to Accumulo. I am
> trying to accomplish the following goals in the prototype.
> # Have each RFile store a sample per locality group. Also store the
> configuration used to generate the sample.
> # Use sampling functions that ensure the same row columns exist across the
> samples in all RFiles. Hash mod is a good candidate that gives a random
> sample that's consistent across files.
> # Have scanners support scanning RFile's samples sets. Scan should fail if
> RFiles have different sample configuration. Different sampling config
> implies the RFile's sample sets contain a possibly disjoint set of row
> columns.
> # Support generating sample data for RFiles generated for bulk import
> # Support sample data in the memory map
> # Support enabling and disabling sampling per table AND configuring a
> sample function.
> I am currently using the following function in my prototype to determine what
> data an RFile stores in its sample set. This code will always select same
> subset of rows for each RFile's sample set. I have not yet made the function
> configurable.
> {code:java}
> public class RowSampler implements Sampler {
> private HashFunction hasher = Hashing.murmur3_32();
> @Override
> public boolean accept(Key k) {
> ByteSequence row = k.getRowData();
> HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(),
> row.length());
> return hc.asInt() % 1009 == 0;
> }
> }
> {code}
> Although not yet implemented, the divisor in this RowSample could be
> configurable. RFiles with sample data would store the fact that a RowSample
> with a divisor of 1009 was used to generate sample data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)