[ https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744741#comment-14744741 ]
ASF GitHub Bot commented on ACCUMULO-3913: ------------------------------------------ Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39470284 --- Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java --- @@ -54,10 +59,11 @@ public TabletIteratorEnvironment(IteratorScope scope, AccumuloConfiguration conf this.config = config; this.fullMajorCompaction = false; this.authorizations = Authorizations.EMPTY; + this.topLevelIterators = new ArrayList<SortedKeyValueIterator<Key,Value>>(); --- End diff -- Unnecessary SKVI<K,V> parameterization? > Add per table sampling > ---------------------- > > Key: ACCUMULO-3913 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3913 > Project: Accumulo > Issue Type: Improvement > Reporter: Keith Turner > Fix For: 1.8.0 > > > I am working on prototyping adding hash based sampling to Accumulo. I am > trying to accomplish the following goals in the prototype. > # Have each RFile store a sample per locality group. Also store the > configuration used to generate the sample. > # Use sampling functions that ensure the same row columns exist across the > samples in all RFiles. Hash mod is a good candidate that gives a random > sample that's consistent across files. > # Have scanners support scanning RFile's samples sets. Scan should fail if > RFiles have different sample configuration. Different sampling config > implies the RFile's sample sets contain a possibly disjoint set of row > columns. > # Support generating sample data for RFiles generated for bulk import > # Support sample data in the memory map > # Support enabling and disabling sampling per table AND configuring a > sample function. > I am currently using the following function in my prototype to determine what > data an RFile stores in its sample set. This code will always select same > subset of rows for each RFile's sample set. I have not yet made the function > configurable. > {code:java} > public class RowSampler implements Sampler { > private HashFunction hasher = Hashing.murmur3_32(); > @Override > public boolean accept(Key k) { > ByteSequence row = k.getRowData(); > HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), > row.length()); > return hc.asInt() % 1009 == 0; > } > } > {code} > Although not yet implemented, the divisor in this RowSample could be > configurable. RFiles with sample data would store the fact that a RowSample > with a divisor of 1009 was used to generate sample data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)