GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/10726
Spark 12469 consistent accumulators for spark
draft PR for review
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark
SPARK-12469-consistent-accumulators-for-spark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10726.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10726
----
commit 7c9f874d94085f60089cda8ab3b6c451bf793a1a
Author: Holden Karau <[email protected]>
Date: 2015-12-28T19:59:30Z
Start adding the proposed external API for consistent accumulators
commit fce92981371ddd7af8ad28f30537fb473bbee378
Author: Holden Karau <[email protected]>
Date: 2015-12-28T19:59:47Z
Start keeping track of the required information in the taskcontext
commit c0e2bfd9d3d5f22e9ea217a500b67e84679f519d
Author: Holden Karau <[email protected]>
Date: 2015-12-30T00:29:53Z
A bit more progress towards consistent accumulators, add an add function
which checks the bitset, throw an exception on merges (perhaps this isn't
reasonable and then we need to keep all of the previous values - but a quick
skim looks like its intended to be a user accesiable function so just don't
support).
commit e9c287f682cbefc531a8fc84b00a472f860c9d06
Author: Holden Karau <[email protected]>
Date: 2015-12-30T01:51:53Z
Waaait that was not thinking clearly. As we go with a consistent
accumulator adding values add them to a hashmap of pending values to be merged
based on the rdd id & partition id - then driver side when merging the
accumulators check each rdd & partition id individually against the rdd &
bitset of processed partitions for that rdd adding to a single accumulator.
(This way we don't need to keep all of the values around forever but if we say
have first() and then foreach() we can add the value for partition one then
when the second task happens and we can skip just the adding partion one. Up
next: start adding some simple tests for this to see if it works
commit cf32dc5d543d4c0bb883e00692d21f22a7470737
Author: Holden Karau <[email protected]>
Date: 2016-01-06T07:45:18Z
Merge branch 'master' into SPARK-12469-consistent-accumulators-for-spark
commit d9354b12977bdaa2bcd82730fb34bdae490ab02d
Author: Holden Karau <[email protected]>
Date: 2016-01-06T20:26:13Z
add consistentValue (temporary) to get the consistent value and fix the
accumulator to work even when no merge happens
commit 28a5754b654441ad8904f6a8f6e4a06db3908541
Author: Holden Karau <[email protected]>
Date: 2016-01-06T20:58:07Z
Introduce a GenericAccumulable to allow for Accumulables which don't return
the same value as the value accumulate (e.g. consistent accumulators accumulate
a lot of book keeping information which the user shouldn't know about so hide
that). Accumulable now extends GenericAccumulable to keep the API the same
commit 21387e1c5ebf10d6a2b26996ba0a54640925a903
Author: Holden Karau <[email protected]>
Date: 2016-01-07T07:37:31Z
Temporary debugging
commit 11ac8364c4077ee14c2284f6d3a17341efb79b61
Author: Holden Karau <[email protected]>
Date: 2016-01-07T07:38:36Z
Add a some tests for consistent accumulators
commit b76218d1645b58be587e9db1337b2867b20dcaf1
Author: Holden Karau <[email protected]>
Date: 2016-01-07T09:11:38Z
Switch to doing explicit passin in the API for now (can figure out the
magic later if we want magic). Next up: handle partially consumed partitions.
commit 7fb981c9f1503747b687f31d37a11b22c87fcf66
Author: Holden Karau <[email protected]>
Date: 2016-01-07T19:22:39Z
Get rid of some old junk added when trying the old approach
commit c75b037cdffd05c415e8e3dbe247f921009e31d6
Author: Holden Karau <[email protected]>
Date: 2016-01-07T21:39:01Z
Have tasks report if they've processed the entire partition and skip
incrementing the consistent accumulator if this is not the case
commit df89265241891380810860ad0f26e653efed298d
Author: Holden Karau <[email protected]>
Date: 2016-01-07T22:54:56Z
Regardless of how much of the iterator we read in the task, if we are using
persistance than the entire partition gets computed
commit 33b45411c4f7ac45c439e020d82db95f8333e0bd
Author: Holden Karau <[email protected]>
Date: 2016-01-07T23:20:34Z
Some improvements
commit b2d0926eac596b0f075db7809df1130bd7025d00
Author: Holden Karau <[email protected]>
Date: 2016-01-11T23:07:13Z
Merge in master
commit 4fd97a0a85f4e2c3a0e9b7edf2deb0a957443c4d
Author: Holden Karau <[email protected]>
Date: 2016-01-12T18:53:01Z
Style fixes - todo check some indentation and such
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]