[
https://issues.apache.org/jira/browse/ACCUMULO-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389572#comment-14389572
]
ASF GitHub Bot commented on ACCUMULO-3602:
------------------------------------------
GitHub user echeipesh opened a pull request:
https://github.com/apache/accumulo/pull/25
ACCUMULO-3602 BatchScanner optimization for AccumuloInputFormat
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/echeipesh/accumulo ACCUMULO-3602
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/accumulo/pull/25.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #25
----
commit 8da593302e2c702d8a775752d5f5620d2ddb3345
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-23T20:43:46Z
Added support for grouping ranges per tablet when using AccumuloInputFormat
commit 47a2fae8423ab383364228469055904803751c1a
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-25T15:08:20Z
Avoid casts by using AccumuloInputSplit interface
commit b4101a37949271a1b388ac5bd3f1b1d14e00d272
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-25T15:10:05Z
upgrade warning to exception if batch scan is requested but not available
commit ad83769d96224c55f909ae4ba0d6482c7555bf1c
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-25T15:11:24Z
close underlying scanner with the RecordReader
commit 46a45c3582289730905ee4a68fa3fcf8f5245ccd
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T19:08:37Z
Upgrade AccumuloInputSplit to abstract class for code reuse
commit 7a09974b4ae731c78e7a99ec94d02c3a7a5ad0b1
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T19:53:58Z
merging upstream master
commit 41e79cf9b71b48be9acf2a44a44a523476c45024
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T20:00:42Z
fix merge errors
commit 7b3beebb2ba023e8862f45c89b61435a7b404368
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T20:32:06Z
add batch scan setters/getters to AccumuloInputFormat
commit c84c8c035414d6f32a864b573f7f6a1f6c16551a
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T21:47:17Z
fix casting error
commit 35495142d007303cbba3c083a07abbe1a859e289
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T21:47:35Z
make findbugs happy
commit 7ac1cef68d651b9c19bfc57db28a718630ae37d1
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T21:47:48Z
test BatchInputSplit generation
commit 9a477220a205816aa8641d2b362c3b2c395c83f0
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T21:58:26Z
match code flow between mapred and mapreduce
also: killing dead code for “iterators”
commit 23f6d400b4794ea8c30570f51fcdae09ba33a17c
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T22:21:59Z
test all splits for correct type, not just head
commit 17fa276c8dc06c6036bc7a4c9ccc2a8cf23f90e9
Author: Eugene Cheipesh <[email protected]>
Date: 2015-03-31T22:55:32Z
Test MR job with BatchScan option enable
----
> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>
> Key: ACCUMULO-3602
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3602
> Project: Accumulo
> Issue Type: Improvement
> Components: client
> Affects Versions: 1.6.1, 1.6.2
> Reporter: Eugene Cheipesh
> Assignee: Eugene Cheipesh
> Labels: performance
>
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}}
> specified in the configuration. Some table indexing schemes, for instance
> z-order geospacial index, produce large number of small ranges resulting in
> large number of splits. This is specifically a concern when using
> {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped
> to an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high
> overhead on processing. A desirable alternative is to group ranges by tablet
> into a single split and use {{BatchScanner}} to produce the records. Grouping
> by tablets is useful because it represents Accumulos attempt to distributed
> stored records and can be influance by the user through table splits.
> The grouping functionality already exists in the internal {{TabletLocator}}
> class.
> Current proposal is to modify {{AbstractInputFormat}} such that it generates
> either {{RangeInputSplit}} or {{MultiRangeInputSplit}} based on a new setting
> in {{InputConfigurator}}. {{AccumuloInputFormat}} would then be able to
> inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in
> 1.7 as it is useful for optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)