[jira] [Commented] (ACCUMULO-3913) Add per table sampling

ASF GitHub Bot (JIRA) Tue, 15 Sep 2015 08:08:30 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745581#comment-14745581
 ]


ASF GitHub Bot commented on ACCUMULO-3913:
------------------------------------------

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/46#discussion_r39521784
  
    --- Diff: core/src/main/thrift/tabletserver.thrift ---
    @@ -164,8 +174,9 @@ service TabletClientService extends 
client.ClientService {
                                       5:map<string, map<string, string>> ssio,
                                       6:list<binary> authorizations,
                                       7:bool waitForWrites,
    -                                  9:i64 batchTimeOut)  throws 
(1:client.ThriftSecurityException sec),
    -  data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 
1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi),
    +                                  9:TSamplerConfiguration samplerConfig,
    --- End diff --
    
    Ah ok. I forgot that.


> Add per table sampling
> ----------------------
>
>                 Key: ACCUMULO-3913
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3913
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Keith Turner
>             Fix For: 1.8.0
>
>
> I am working on prototyping adding hash based sampling to Accumulo.  I am 
> trying to accomplish the following goals in the prototype.
>   # Have each RFile store a sample per locality group.  Also store the 
> configuration used to generate the sample.
>   # Use sampling functions that ensure the same row columns exist across the 
> samples in all RFiles. Hash mod is a good candidate that gives a random 
> sample that's consistent across files.
>   # Have scanners support scanning RFile's samples sets.  Scan should fail if 
> RFiles have different sample configuration.  Different sampling config 
> implies the RFile's sample sets contain a possibly disjoint set of row 
> columns.
>   # Support generating sample data for RFiles generated for bulk import
>   # Support sample data in the memory map
>   # Support enabling and disabling sampling per table AND configuring a 
> sample function.
> I am currently using the following function in my prototype to determine what 
> data an RFile stores in its sample set.  This code will always select same 
> subset of rows for each RFile's sample set.  I have not yet made the function 
> configurable.
> {code:java}
> public class RowSampler implements Sampler {
>   private HashFunction hasher = Hashing.murmur3_32();
>   @Override
>   public boolean accept(Key k) {
>     ByteSequence row = k.getRowData();
>     HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), 
> row.length());
>     return hc.asInt() % 1009 == 0;
>   }
> }
> {code}
> Although not yet implemented, the divisor in this RowSample could be 
> configurable. RFiles with sample data would store the fact that a RowSample 
> with a divisor of 1009 was used to generate sample data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-3913) Add per table sampling

Reply via email to