[ https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732124#comment-14732124 ]
ASF GitHub Bot commented on ACCUMULO-3913: ------------------------------------------ Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815377 --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SamplerConfiguration.java --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client.admin; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.Collections; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; + +import com.google.common.base.Preconditions; + +/** + * @since 1.8.0 + */ + +public class SamplerConfiguration { --- End diff -- Needs javadoc > Add per table sampling > ---------------------- > > Key: ACCUMULO-3913 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3913 > Project: Accumulo > Issue Type: Improvement > Reporter: Keith Turner > Fix For: 1.8.0 > > > I am working on prototyping adding hash based sampling to Accumulo. I am > trying to accomplish the following goals in the prototype. > # Have each RFile store a sample per locality group. Also store the > configuration used to generate the sample. > # Use sampling functions that ensure the same row columns exist across the > samples in all RFiles. Hash mod is a good candidate that gives a random > sample that's consistent across files. > # Have scanners support scanning RFile's samples sets. Scan should fail if > RFiles have different sample configuration. Different sampling config > implies the RFile's sample sets contain a possibly disjoint set of row > columns. > # Support generating sample data for RFiles generated for bulk import > # Support sample data in the memory map > # Support enabling and disabling sampling per table AND configuring a > sample function. > I am currently using the following function in my prototype to determine what > data an RFile stores in its sample set. This code will always select same > subset of rows for each RFile's sample set. I have not yet made the function > configurable. > {code:java} > public class RowSampler implements Sampler { > private HashFunction hasher = Hashing.murmur3_32(); > @Override > public boolean accept(Key k) { > ByteSequence row = k.getRowData(); > HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), > row.length()); > return hc.asInt() % 1009 == 0; > } > } > {code} > Although not yet implemented, the divisor in this RowSample could be > configurable. RFiles with sample data would store the fact that a RowSample > with a divisor of 1009 was used to generate sample data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)