[ 
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486088#comment-14486088
 ] 

ASF GitHub Bot commented on ACCUMULO-3602:
------------------------------------------

Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/25#issuecomment-91041003
  
    > > Agreed, but is this a new issue?
    
    > I think the intent of this PR is new. I think the intent is to 
efficiently support a map reduce job that reads from many small ranges. If we 
are going to do that, then lets do it in a way that scales better. 
    
    Re-reading the issue description, yes, you are correct. However, what 
Eugene has already done is already an improvement over what currently exists, 
IMO. Assuming that the majority of Ranges that his Spark workflow creates are 
non-overlapping (a hunch given the subject matter -- very small boxes of 
interest), this still reduces the total memory use of splits and Ranges. The 
worst case would be numerous very large ranges. We're never going to have the 
ideal solution the first time around -- I just don't want this to stall because 
we come up with a different solution to an already solved problem.
    
    I'm happy to continue the discussion for dealing with large numbers of 
ranges efficiently, I'm just not convinced this is the place to do it.


> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
>                 Key: ACCUMULO-3602
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>             Fix For: 1.7.0
>
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} 
> specified in the configuration. Some table indexing schemes, for instance 
> z-order geospacial index, produce large number of small ranges resulting in 
> large number of splits. This is specifically a concern when using 
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped 
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high 
> overhead on processing. A desirable alternative is to group ranges by tablet 
> into a single split and use {{BatchScanner}} to produce the records. Grouping 
> by tablets is useful because it represents Accumulos attempt to distributed 
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}} 
> class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates 
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting 
> in {{InputConfigurator}}.  {{AccumuloInputFormat}} would then be able to 
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 
> 1.7 as it is useful for optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to