[jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat

ASF GitHub Bot (JIRA) Wed, 08 Apr 2015 13:54:12 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486005#comment-14486005
 ]


ASF GitHub Bot commented on ACCUMULO-3602:
------------------------------------------

Github user keith-turner commented on the pull request:

    https://github.com/apache/accumulo/pull/25#issuecomment-91033820
  
    I was discussing the big picture behind this PR w/ @ctubbsii .   It seems 
like this change could encourage users to pass many ranges as configuration for 
the map reduce job.   This could cause memory exhaustion for the job tracker.   
    
    We discussed passing a function which generates a set of ranges, instead of 
passing lots of ranges.  The implementation would still use a batch scanner (or 
scanner with a special iterator but its harder to pass code to tserver).   Each 
input split could call a function like the following which deterministically 
creates a set of ranges.   Then those ranges could be used for the batch 
scanner. 
    
    ```java
    interface RangeGenerator {
      /**
       * @param tabletRange  The data range for the tablet over which the input 
split is executing
       * @param config a mysterious class that allows user to pass parameters 
to the function
       */
      List<Range> createRanges(Range tabletRange, Myst config);
    }
    ```
    When configuring the AccumuloInputFormat to use the batch scanner, a class 
name that implements this function would be provided.   The ranges set on the 
job would be large ranges for portions of table to process.  An input split 
would be created for each tablet that falls within those large ranges, and for 
each input split the function would be called to possibly create many more 
ranges.



> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
>                 Key: ACCUMULO-3602
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>             Fix For: 1.7.0
>
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} 
> specified in the configuration. Some table indexing schemes, for instance 
> z-order geospacial index, produce large number of small ranges resulting in 
> large number of splits. This is specifically a concern when using 
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped 
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high 
> overhead on processing. A desirable alternative is to group ranges by tablet 
> into a single split and use {{BatchScanner}} to produce the records. Grouping 
> by tablets is useful because it represents Accumulos attempt to distributed 
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}} 
> class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates 
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting 
> in {{InputConfigurator}}.  {{AccumuloInputFormat}} would then be able to 
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 
> 1.7 as it is useful for optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat

Reply via email to