[ https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737323#comment-14737323 ]
ASF GitHub Bot commented on ACCUMULO-3913: ------------------------------------------ Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076343 --- Diff: docs/src/main/resources/examples/README.sample --- @@ -0,0 +1,188 @@ +Title: Apache Accumulo Batch Writing and Scanning Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + +Basic Sampling Example --- End diff -- I'll do that > Add per table sampling > ---------------------- > > Key: ACCUMULO-3913 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3913 > Project: Accumulo > Issue Type: Improvement > Reporter: Keith Turner > Fix For: 1.8.0 > > > I am working on prototyping adding hash based sampling to Accumulo. I am > trying to accomplish the following goals in the prototype. > # Have each RFile store a sample per locality group. Also store the > configuration used to generate the sample. > # Use sampling functions that ensure the same row columns exist across the > samples in all RFiles. Hash mod is a good candidate that gives a random > sample that's consistent across files. > # Have scanners support scanning RFile's samples sets. Scan should fail if > RFiles have different sample configuration. Different sampling config > implies the RFile's sample sets contain a possibly disjoint set of row > columns. > # Support generating sample data for RFiles generated for bulk import > # Support sample data in the memory map > # Support enabling and disabling sampling per table AND configuring a > sample function. > I am currently using the following function in my prototype to determine what > data an RFile stores in its sample set. This code will always select same > subset of rows for each RFile's sample set. I have not yet made the function > configurable. > {code:java} > public class RowSampler implements Sampler { > private HashFunction hasher = Hashing.murmur3_32(); > @Override > public boolean accept(Key k) { > ByteSequence row = k.getRowData(); > HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), > row.length()); > return hc.asInt() % 1009 == 0; > } > } > {code} > Although not yet implemented, the divisor in this RowSample could be > configurable. RFiles with sample data would store the fact that a RowSample > with a divisor of 1009 was used to generate sample data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)