[ 
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377191#comment-14377191
 ] 

Josh Elser commented on ACCUMULO-3602:
--------------------------------------

Left some commits on your most recent commit in that branch. Thanks for 
sharing, [~echeipesh]! Some bigger concerns that we'll need to talk through:

* Resource management on the BatchScanner. Are we sure that it will get closed? 
I might have just missed it happening elsewhere (via ScannerBase).
* All of the MapReduce classes/interfaces/etc under 
`org.apache.accumulo.core.client` are (painfully) considered to be in the 
public API. Since we're following semver, it's unlikely that we'll be able to 
make this change 1.6.3 as you've currently targeted. We'd have to go for 1.7.0. 
The mapred/mapreduce classes often fall into this gray area where we don't 
*really* want them all to be public API, but they got grandfathered in. We'll 
have to keep this in mind.
* The duplication between RangeInputSplit and MultiRangeInputSplit is painful 
(I think you called this out to begin with). Maybe there's some more 
consolidation that can be done with RangeInputSplit? I'm worried about it 
because we've had a couple of bugs come out of RangeInputSplit due to the 
duplicative code and insufficient testing.

As far as testing, you can try to take a look at AccumuloInputFormatIT.java. 
That has a good example of running a "mapreduce" job over a 
MiniAccumuloCluster. You could also check out the MiniMRYarnCluster class 
provided by Hadoop/YARN itself which should hopefully give you a more realistic 
test environment (as opposed to using the local MR runner). ExamplesIT.java 
also has some examples of running MR jobs against Accumulo which you could look 
at. In short, the more you can unit test, the better. A test or two to ensure 
high-level functionality would be good IMO.

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
>                 Key: ACCUMULO-3602
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} 
> specified in the configuration. Some table indexing schemes, for instance 
> z-order geospacial index, produce large number of small ranges resulting in 
> large number of splits. This is specifically a concern when using 
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped 
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high 
> overhead on processing. A desirable alternative is to group ranges by tablet 
> into a single split and use {{BatchScanner}} to produce the records. Grouping 
> by tablets is useful because it represents Accumulos attempt to distributed 
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}} 
> class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates 
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting 
> in {{InputConfigurator}}.  {{AccumuloInputFormat}} would then be able to 
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 
> 1.7 as it is useful for optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to