[ https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737316#comment-14737316 ]
ASF GitHub Bot commented on ACCUMULO-3913: ------------------------------------------ Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076125 --- Diff: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java --- @@ -297,4 +345,50 @@ public void setReadaheadThreshold(long batches) { } this.readaheadThreshold = batches; } + + private SamplerConfiguration getIteratorSamplerConfigurationInternal() { + SamplerConfiguration scannerSamplerConfig = getSamplerConfiguration(); + if (scannerSamplerConfig != null) { + if (iteratorSamplerConfig != null && !new SamplerConfigurationImpl(iteratorSamplerConfig).equals(new SamplerConfigurationImpl(scannerSamplerConfig))) { --- End diff -- fixed in f6f60ad . At one point `SamplerConfiguration` did not have `equals()` impl, but now it does so I shortened it. > Add per table sampling > ---------------------- > > Key: ACCUMULO-3913 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3913 > Project: Accumulo > Issue Type: Improvement > Reporter: Keith Turner > Fix For: 1.8.0 > > > I am working on prototyping adding hash based sampling to Accumulo. I am > trying to accomplish the following goals in the prototype. > # Have each RFile store a sample per locality group. Also store the > configuration used to generate the sample. > # Use sampling functions that ensure the same row columns exist across the > samples in all RFiles. Hash mod is a good candidate that gives a random > sample that's consistent across files. > # Have scanners support scanning RFile's samples sets. Scan should fail if > RFiles have different sample configuration. Different sampling config > implies the RFile's sample sets contain a possibly disjoint set of row > columns. > # Support generating sample data for RFiles generated for bulk import > # Support sample data in the memory map > # Support enabling and disabling sampling per table AND configuring a > sample function. > I am currently using the following function in my prototype to determine what > data an RFile stores in its sample set. This code will always select same > subset of rows for each RFile's sample set. I have not yet made the function > configurable. > {code:java} > public class RowSampler implements Sampler { > private HashFunction hasher = Hashing.murmur3_32(); > @Override > public boolean accept(Key k) { > ByteSequence row = k.getRowData(); > HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), > row.length()); > return hc.asInt() % 1009 == 0; > } > } > {code} > Although not yet implemented, the divisor in this RowSample could be > configurable. RFiles with sample data would store the fact that a RowSample > with a divisor of 1009 was used to generate sample data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)