Github user keith-turner commented on the pull request:
https://github.com/apache/accumulo/pull/25#issuecomment-91033820
I was discussing the big picture behind this PR w/ @ctubbsii . It seems
like this change could encourage users to pass many ranges as configuration for
the map reduce job. This could cause memory exhaustion for the job tracker.
We discussed passing a function which generates a set of ranges, instead of
passing lots of ranges. The implementation would still use a batch scanner (or
scanner with a special iterator but its harder to pass code to tserver). Each
input split could call a function like the following which deterministically
creates a set of ranges. Then those ranges could be used for the batch
scanner.
```java
interface RangeGenerator {
/**
* @param tabletRange The data range for the tablet over which the input
split is executing
* @param config a mysterious class that allows user to pass parameters
to the function
*/
List<Range> createRanges(Range tabletRange, Myst config);
}
```
When configuring the AccumuloInputFormat to use the batch scanner, a class
name that implements this function would be provided. The ranges set on the
job would be large ranges for portions of table to process. An input split
would be created for each tablet that falls within those large ranges, and for
each input split the function would be called to possibly create many more
ranges.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---