[
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486005#comment-14486005
]
ASF GitHub Bot commented on ACCUMULO-3602:
------------------------------------------
Github user keith-turner commented on the pull request:
https://github.com/apache/accumulo/pull/25#issuecomment-91033820
I was discussing the big picture behind this PR w/ @ctubbsii . It seems
like this change could encourage users to pass many ranges as configuration for
the map reduce job. This could cause memory exhaustion for the job tracker.
We discussed passing a function which generates a set of ranges, instead of
passing lots of ranges. The implementation would still use a batch scanner (or
scanner with a special iterator but its harder to pass code to tserver). Each
input split could call a function like the following which deterministically
creates a set of ranges. Then those ranges could be used for the batch
scanner.
```java
interface RangeGenerator {
/**
* @param tabletRange The data range for the tablet over which the input
split is executing
* @param config a mysterious class that allows user to pass parameters
to the function
*/
List<Range> createRanges(Range tabletRange, Myst config);
}
```
When configuring the AccumuloInputFormat to use the batch scanner, a class
name that implements this function would be provided. The ranges set on the
job would be large ranges for portions of table to process. An input split
would be created for each tablet that falls within those large ranges, and for
each input split the function would be called to possibly create many more
ranges.
> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
> Key: ACCUMULO-3602
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
> Project: Accumulo
> Issue Type: Improvement
> Components: client
> Affects Versions: 1.6.1, 1.6.2
> Reporter: Eugene Cheipesh
> Assignee: Eugene Cheipesh
> Labels: performance
> Fix For: 1.7.0
>
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}}
> specified in the configuration. Some table indexing schemes, for instance
> z-order geospacial index, produce large number of small ranges resulting in
> large number of splits. This is specifically a concern when using
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high
> overhead on processing. A desirable alternative is to group ranges by tablet
> into a single split and use {{BatchScanner}} to produce the records. Grouping
> by tablets is useful because it represents Accumulos attempt to distributed
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}}
> class.
> Current proposal is to modify {{AbstractInputFormat}} such that it generates
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting
> in {{InputConfigurator}}. {{AccumuloInputFormat}} would then be able to
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in
> 1.7 as it is useful for optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)