[ 
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380170#comment-14380170
 ] 

Josh Elser commented on ACCUMULO-3602:
--------------------------------------

bq. Just to spell out the issue, we can't change the RangeInputSplit interface 
because it would break semver, but by the same token adding 
MultiRangeInputSplit to the same package also breaks semver as it's not 
available in 1.6.2

You got it.

bq. This feels a little weird since the InputSplits are only ever used by the 
InputFormat, as I understand it.

I hear you, but the concern for RangeInputSplit is that someone else could have 
extended it and your changes might break them. As for the addition of 
MultiRangeInputSplit, this is just strictly prohibited by semver as 1.6.3 
should only contain bug-fixes.

bq. Can you see any way that this functionality making it into 1.6.3?

Sadly, I don't think so. Perhaps it's some consolation that 1.7 should happen 
in the next month or two? We've kind of been twiddling our thumbs on the 
matter. IMO, we should start pushing it out.

bq. a universal InputSplit by combining functionality of the two and create 
appropriate reader based on split values and roll that into 1.7.

Interesting, have you given thought into the logic that would control when 
either is used? I remember seeing some experiments where replacing the Scanner 
with a BatchScanner had very positive outcomes (obv you're one of them), but I 
wonder if it's always true? We should try be cautious in changing how the 
InputFormat works (even from 1.6 to 1.7).

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
>                 Key: ACCUMULO-3602
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} 
> specified in the configuration. Some table indexing schemes, for instance 
> z-order geospacial index, produce large number of small ranges resulting in 
> large number of splits. This is specifically a concern when using 
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped 
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high 
> overhead on processing. A desirable alternative is to group ranges by tablet 
> into a single split and use {{BatchScanner}} to produce the records. Grouping 
> by tablets is useful because it represents Accumulos attempt to distributed 
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}} 
> class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates 
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting 
> in {{InputConfigurator}}.  {{AccumuloInputFormat}} would then be able to 
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 
> 1.7 as it is useful for optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to