[GitHub] spark pull request: Spark 12469 consistent accumulators for spark

holdenk Tue, 12 Jan 2016 10:57:17 -0800

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/10726


    Spark 12469 consistent accumulators for spark

    draft PR for review

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark 
SPARK-12469-consistent-accumulators-for-spark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10726.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10726
    
----
commit 7c9f874d94085f60089cda8ab3b6c451bf793a1a
Author: Holden Karau <[email protected]>
Date:   2015-12-28T19:59:30Z

    Start adding the proposed external API for consistent accumulators

commit fce92981371ddd7af8ad28f30537fb473bbee378
Author: Holden Karau <[email protected]>
Date:   2015-12-28T19:59:47Z

    Start keeping track of the required information in the taskcontext

commit c0e2bfd9d3d5f22e9ea217a500b67e84679f519d
Author: Holden Karau <[email protected]>
Date:   2015-12-30T00:29:53Z

    A bit more progress towards consistent accumulators, add an add function 
which checks the bitset, throw an exception on merges (perhaps this isn't 
reasonable and then we need to keep all of the previous values - but a quick 
skim looks like its intended to be a user accesiable function so just don't 
support).

commit e9c287f682cbefc531a8fc84b00a472f860c9d06
Author: Holden Karau <[email protected]>
Date:   2015-12-30T01:51:53Z

    Waaait that was not thinking clearly. As we go with a consistent 
accumulator adding values add them to a hashmap of pending values to be merged 
based on the rdd id & partition id - then driver side when merging the 
accumulators check each rdd & partition id individually against the rdd & 
bitset of processed partitions for that rdd adding to a single accumulator. 
(This way we don't need to keep all of the values around forever but if we say 
have first() and then foreach() we can add the value for partition one then 
when the second task happens and we can skip just the adding partion one. Up 
next: start adding some simple tests for this to see if it works

commit cf32dc5d543d4c0bb883e00692d21f22a7470737
Author: Holden Karau <[email protected]>
Date:   2016-01-06T07:45:18Z

    Merge branch 'master' into SPARK-12469-consistent-accumulators-for-spark

commit d9354b12977bdaa2bcd82730fb34bdae490ab02d
Author: Holden Karau <[email protected]>
Date:   2016-01-06T20:26:13Z

    add consistentValue (temporary) to get the consistent value and fix the 
accumulator to work even when no merge happens

commit 28a5754b654441ad8904f6a8f6e4a06db3908541
Author: Holden Karau <[email protected]>
Date:   2016-01-06T20:58:07Z

    Introduce a GenericAccumulable to allow for Accumulables which don't return 
the same value as the value accumulate (e.g. consistent accumulators accumulate 
a lot of book keeping information which the user shouldn't know about so hide 
that). Accumulable now extends GenericAccumulable to keep the API the same

commit 21387e1c5ebf10d6a2b26996ba0a54640925a903
Author: Holden Karau <[email protected]>
Date:   2016-01-07T07:37:31Z

    Temporary debugging

commit 11ac8364c4077ee14c2284f6d3a17341efb79b61
Author: Holden Karau <[email protected]>
Date:   2016-01-07T07:38:36Z

    Add a some tests for consistent accumulators

commit b76218d1645b58be587e9db1337b2867b20dcaf1
Author: Holden Karau <[email protected]>
Date:   2016-01-07T09:11:38Z

    Switch to doing explicit passin in the API for now (can figure out the 
magic later if we want magic). Next up: handle partially consumed partitions.

commit 7fb981c9f1503747b687f31d37a11b22c87fcf66
Author: Holden Karau <[email protected]>
Date:   2016-01-07T19:22:39Z

    Get rid of some old junk added when trying the old approach

commit c75b037cdffd05c415e8e3dbe247f921009e31d6
Author: Holden Karau <[email protected]>
Date:   2016-01-07T21:39:01Z

    Have tasks report if they've processed the entire partition and skip 
incrementing the consistent accumulator if this is not the case

commit df89265241891380810860ad0f26e653efed298d
Author: Holden Karau <[email protected]>
Date:   2016-01-07T22:54:56Z

    Regardless of how much of the iterator we read in the task, if we are using 
persistance than the entire partition gets computed

commit 33b45411c4f7ac45c439e020d82db95f8333e0bd
Author: Holden Karau <[email protected]>
Date:   2016-01-07T23:20:34Z

    Some improvements

commit b2d0926eac596b0f075db7809df1130bd7025d00
Author: Holden Karau <[email protected]>
Date:   2016-01-11T23:07:13Z

    Merge in master

commit 4fd97a0a85f4e2c3a0e9b7edf2deb0a957443c4d
Author: Holden Karau <[email protected]>
Date:   2016-01-12T18:53:01Z

    Style fixes - todo check some indentation and such

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Spark 12469 consistent accumulators for spark

Reply via email to