[GitHub] spark pull request: [BRANCH-1.2][SPARK-4583][MLLIB] LogLoss for Gr...

mengxr Tue, 25 Nov 2014 23:40:28 -0800

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/3474


    [BRANCH-1.2][SPARK-4583][MLLIB] LogLoss for GradientBoostedTrees fix + doc 
updates

    We reverted #3439 in branch-1.2 due to missing `import 
o.a.s.SparkContext._`, which is no longer needed in master (#3262). This PR 
adds #3439 back to branch-1.2 with correct imports.
    
    Github is out-of-sync now. The real changes are the last two commits.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-4583-1.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3474
    
----
commit a9944c809017cc61c9c2e38efe9d709dfb0a94cd
Author: Tathagata Das <[email protected]>
Date:   2014-11-25T22:16:27Z

    [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in 
PairDStreamFunctions.saveAsNewAPIHadoopFiles
    
    Solves two JIRAs in one shot
    - Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable 
for checkpoints
    - Makes the default configuration object used saveAsNewAPIHadoopFiles be 
the Spark's hadoop configuration
    
    Author: Tathagata Das <[email protected]>
    
    Closes #3457 from tdas/savefiles-fix and squashes the following commits:
    
    bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles
    b382ea9 [Tathagata Das] Fix serialization issue in 
PairDStreamFunctions.saveAsNewAPIHadoopFiles.
    
    (cherry picked from commit 8838ad7c135a585cde015dc38b5cb23314502dd9)
    Signed-off-by: Tathagata Das <[email protected]>

commit a2c01ae5e3489b6c21a4c7bcc1ec615069ff4829
Author: Tathagata Das <[email protected]>
Date:   2014-11-25T23:27:20Z

    [HOTFIX] Fixing broken build due to missing imports.

commit ee0317509ee1dfd9c5807890412f9ac5ebf16eb3
Author: Andrew Or <[email protected]>
Date:   2014-11-25T23:46:26Z

    [SPARK-4592] Avoid duplicate worker registrations in standalone mode
    
    **Summary.** On failover, the Master may receive duplicate registrations 
from the same worker, causing the worker to exit. This is caused by this commit 
https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f,
 which adds logic for the worker to re-register with the master in case of 
failures. However, the following race condition may occur:
    
    (1) Master A fails and Worker attempts to reconnect to all masters
    (2) Master B takes over and notifies Worker
    (3) Worker responds by registering with Master B
    (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, 
causing the same Worker to register with Master B twice
    
    **Fix.** Instead of attempting to register with all known masters, the 
worker should re-register with only the one that it has been communicating 
with. This is safe because the fact that a failover has occurred means the old 
master must have died. Then, when the worker is finally notified of a new 
master, it gives up on the old one in favor of the new one.
    
    **Caveat.** Even this fix is subject to more obscure race conditions. For 
instance, if Master B fails and Master A recovers immediately, then Master A 
may still observe duplicate worker registrations. However, this and other 
potential race conditions summarized in 
[SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much 
less likely than the one described above, which is deterministically 
reproducible.
    
    Author: Andrew Or <[email protected]>
    
    Closes #3447 from andrewor14/standalone-failover and squashes the following 
commits:
    
    0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety
    79286dc [Andrew Or] Preserve old behavior for initial retries
    83b321c [Andrew Or] Tweak wording
    1fce6a9 [Andrew Or] Active master actor could be null in the beginning
    b6f269e [Andrew Or] Avoid duplicate worker registrations
    
    (cherry picked from commit 1b2ab1cd1b7cab9076f3c511188a610eda935701)
    Signed-off-by: Andrew Or <[email protected]>

commit 58c840dde8776efefd5e180d95379598fd061172
Author: Andrew Or <[email protected]>
Date:   2014-11-25T23:48:02Z

    [SPARK-4546] Improve HistoryServer first time user experience
    
    The documentation points the user to run the following
    ```
    sbin/start-history-server.sh
    ```
    The first thing this does is throw an exception that complains a log 
directory is not specified. The exception message itself does not say anything 
about what to set. Instead we should have a default and a landing page with a 
better message. The new default log directory is `file:/tmp/spark-events`.
    
    This is what it looks like as of this PR:
    
    
![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)
    
    Author: Andrew Or <[email protected]>
    
    Closes #3411 from andrewor14/minor-history-improvements and squashes the 
following commits:
    
    f33d6b3 [Andrew Or] Point user to set config if default log dir does not 
exist
    fc4c17a [Andrew Or] Improve HistoryServer UX
    
    (cherry picked from commit 9afcbe494a3535a9bf7958429b72e989972f82d9)
    Signed-off-by: Andrew Or <[email protected]>

commit 93b914df1566c6359d8f1546ab7344823dc4341f
Author: hushan[è¡ç] <[email protected]>
Date:   2014-11-25T23:51:08Z

    Fix SPARK-4471: blockManagerIdFromJson function throws exception while B...
    
    Fix [SPARK-4471](https://issues.apache.org/jira/browse/SPARK-4471): 
blockManagerIdFromJson function throws exception while BlockManagerId be null 
in MetadataFetchFailedException
    
    Author: hushan[è¡ç] <[email protected]>
    
    Closes #3340 from suyanNone/fix-blockmanagerId-jnothing-2 and squashes the 
following commits:
    
    159f9a3 [hushan[è¡ç]] Refine test code for blockmanager is null
    4380d73 [hushan[è¡ç]] remove useless blank line
    3ccf651 [hushan[è¡ç]] Fix SPARK-4471: blockManagerIdFromJson function 
throws exception while metadata fetch failed
    
    (cherry picked from commit 9bdf5da59036c0b052df756fc4a28d64677072e7)
    Signed-off-by: Andrew Or <[email protected]>

commit a48ea3cef22687694a4471705fb707bd1e8fa592
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-26T00:07:09Z

    [Spark-4509] Revert EC2 tag-based cluster membership patch
    
    This PR reverts changes related to tag-based cluster membership. As 
discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to 
determine cluster membership, because tagging is not atomic. The following 
changes are reverted:
    
    SPARK-2333: 94053a7b766788bb62e2dbbf352ccbcc75f71fc0
    SPARK-3213: 7faf755ae4f0cf510048e432340260a6e609066d
    SPARK-3608: 78d4220fa0bf2f9ee663e34bbf3544a5313b02f0.
    
    I tested launch, login, and destroy. It is easy to check the diff by 
comparing it to Josh's patch for branch-1.1:
    
    https://github.com/apache/spark/pull/2225/files
    
    JoshRosen I sent the PR to master. It might be easier for us to keep master 
and branch-1.2 the same at this time. We can always re-apply the patch once we 
figure out a stable solution.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #3453 from mengxr/SPARK-4509 and squashes the following commits:
    
    f0b708b [Xiangrui Meng] revert 94053a7b766788bb62e2dbbf352ccbcc75f71fc0
    4298ea5 [Xiangrui Meng] revert 7faf755ae4f0cf510048e432340260a6e609066d
    35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming 
succeeds"
    
    (cherry picked from commit 7eba0fbe456c451122d7a2353ff0beca00f15223)
    Signed-off-by: Andrew Or <[email protected]>

commit 6880b467f66a4906161cbc343e70d975056a4f5f
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-26T04:10:15Z

    [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
    
    Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
    * the gradient (and therefore loss) does not match that used by Friedman 
(1999)
    * the error computation uses 0/1 accuracy, not log loss
    
    This PR updates LogLoss.
    It also adds some doc for boosting and forests.
    
    I tested it on sample data and made sure the log loss is monotonically 
decreasing with each boosting iteration.
    
    CC: mengxr manishamde codedeft
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:
    
    cfec17e [Joseph K. Bradley] removed forgotten temp comments
    a27eb6d [Joseph K. Bradley] corrections to last log loss commit
    ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical 
stability
    5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also 
required updating the test suite since it effectively doubles the gradient and 
loss. * Added doc for developers within RandomForest. * Small cleanup in test 
suite (generating data only once)
    e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and 
updated doc for losses, forests, and boosting
    
    (cherry picked from commit c251fd7405db57d3ab2686c38712601fd8f13ccd)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 37d58aaac20b9ab34ea50c9e62905c7f80fe5036
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T04:10:19Z

    [HOTFIX]: Adding back without-hive dist

commit 2756d0de91d996f80c0b0883cad1d2fab336ed84
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-26T04:11:40Z

    [SPARK-4604][MLLIB] make MatrixFactorizationModel public
    
    User could construct an MF model directly. I added a note about the 
performance.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #3459 from mengxr/SPARK-4604 and squashes the following commits:
    
    f64bcd3 [Xiangrui Meng] organize imports
    ed08214 [Xiangrui Meng] check preconditions and unit tests
    a624c12 [Xiangrui Meng] make MatrixFactorizationModel public
    
    (cherry picked from commit b5fb1410c5eed1156decb4f9fcc22436a658ce4d)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 1e12f594be277f6b390c998b1a1e5581ecebdcb0
Author: Aaron Davidson <[email protected]>
Date:   2014-11-26T04:57:04Z

    [SPARK-4516] Cap default number of Netty threads at 8
    
    In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, 
and each core that we use will have an initial overhead of roughly 32 MB of 
off-heap memory, which comes at a premium.
    
    Thus, this value should still retain maximum throughput and reduce wasted 
off-heap memory allocation. It can be overridden by setting the number of 
serverThreads and clientThreads manually in Spark's configuration.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes #3469 from aarondav/fewer-pools2 and squashes the following commits:
    
    087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads 
at 8
    
    (cherry picked from commit f5f2d27385c243959f03a9d78a149d5f405b2f50)
    Signed-off-by: Patrick Wendell <[email protected]>

commit b028aaff161ad749e4723f5821ed000320a6665e
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:06:14Z

    Revert "Preparing development version 1.2.1-SNAPSHOT"
    
    This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8.

commit 01271786e67bdf8441824fb4dd9ed6e9fd95eaaa
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:06:16Z

    Revert "Preparing Spark release v1.2.0-snapshot1"
    
    This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b.

commit db7f4a898af22a02b36428507f8ef2b429d78dc1
Author: Ubuntu <[email protected]>
Date:   2014-11-26T05:07:50Z

    Preparing Spark release v1.2.0-rc1

commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b
Author: Ubuntu <[email protected]>
Date:   2014-11-26T05:07:50Z

    Preparing development version 1.2.1-SNAPSHOT

commit 68a217cd1a792ca3486442e9aa63ca0258e88762
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:08:57Z

    Revert "Preparing development version 1.2.1-SNAPSHOT"
    
    This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b.

commit ce6200b265e63979483e0cccecff391faa159903
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:09:01Z

    Revert "Preparing Spark release v1.2.0-rc1"
    
    This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1.

commit 5247dd859b95a440baa562b9827bdeb26aa6530e
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:10:29Z

    Preparing Spark release v1.2.0-rc1

commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:10:29Z

    Preparing development version 1.2.1-SNAPSHOT

commit 37bc7a830e862d47776b85767ba599d61ef13e01
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:11:49Z

    Revert "Preparing development version 1.2.1-SNAPSHOT"
    
    This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f.

commit de8029b39142be5e91714a9d5240bcdb90f66886
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:11:58Z

    Revert "Preparing Spark release v1.2.0-rc1"
    
    This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e.

commit dfb8c65b730fdf60540e91cd74fbaa2764a2a2bc
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:16:20Z

    HOTFIX: Updating additional version data

commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:17:08Z

    Preparing Spark release v1.2.0-rc1

commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:17:09Z

    Preparing development version 1.2.1-SNAPSHOT

commit c7185f0c08e2a42e2595466e2d8ac394cbf66f5b
Author: Aaron Davidson <[email protected]>
Date:   2014-11-26T05:32:45Z

    [SPARK-4516] Avoid allocating Netty PooledByteBufAllocators unnecessarily
    
    Turns out we are allocating an allocator pool for every TransportClient 
(which means that the number increases with the number of nodes in the 
cluster), when really we should just reuse one for all clients.
    
    This patch, as expected, greatly decreases off-heap memory allocation, and 
appears to make allocation only proportional to the number of cores.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes #3465 from aarondav/fewer-pools and squashes the following commits:
    
    36c49da [Aaron Davidson] [SPARK-4516] Avoid allocating unnecessarily Netty 
PooledByteBufAllocators
    
    (cherry picked from commit 346bc17a2ec8fc9e6eaff90733aa1e8b6b46883e)
    Signed-off-by: Patrick Wendell <[email protected]>

commit 537d699a53b1fe227d570635e3b4a33abf2d72ab
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:36:35Z

    Revert "Preparing development version 1.2.1-SNAPSHOT"
    
    This reverts commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa.

commit 8f5ebcb63c28254abf60cce87c3706ccdee3c91a
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:36:43Z

    Revert "Preparing Spark release v1.2.0-rc1"
    
    This reverts commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2.

commit 17a4b8e597391af3a258f8f4f9c910e341ba39c3
Author: Patrick Wendell <[email protected]>
Date:   2014-11-26T05:42:01Z

    Revert "[SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc 
updates"
    
    This reverts commit 6880b467f66a4906161cbc343e70d975056a4f5f.

commit 69d021b0becdffe225a1c8859d8c6adeb1a94f4a
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-26T06:29:56Z

    Revert "[SPARK-4604][MLLIB] make MatrixFactorizationModel public"
    
    This reverts commit 2756d0de91d996f80c0b0883cad1d2fab336ed84.

commit e8669729af4b49423a7514830436b2cb4ee6a08a
Author: Tathagata Das <[email protected]>
Date:   2014-11-26T07:15:58Z

    [SPARK-4612] Reduce task latency and increase scheduling throughput by 
making configuration initialization lazy
    
    
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337
 creates a configuration object for every task that is launched, even if there 
is no new dependent file/JAR to update. This is a heavy-weight creation that 
should be avoided if there is no new file/JAR to update. This PR makes that 
creation lazy. Quick local test in spark-perf scheduling throughput tests gives 
the following numbers in a local standalone scheduler mode.
    1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x 
increase in task scheduling throughput
    
    pwendell JoshRosen
    
    Author: Tathagata Das <[email protected]>
    
    Closes #3463 from tdas/lazy-config and squashes the following commits:
    
    c791c1e [Tathagata Das] Reduce task latency by making configuration 
initialization lazy
    
    (cherry picked from commit e7f4d2534bb3361ec4b7af0d42bc798a7a425226)
    Signed-off-by: Reynold Xin <[email protected]>

commit 6b5564ab5e63541cbe11e0c13db645b140b68301
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-26T04:10:15Z

    [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
    
    Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
    * the gradient (and therefore loss) does not match that used by Friedman 
(1999)
    * the error computation uses 0/1 accuracy, not log loss
    
    This PR updates LogLoss.
    It also adds some doc for boosting and forests.
    
    I tested it on sample data and made sure the log loss is monotonically 
decreasing with each boosting iteration.
    
    CC: mengxr manishamde codedeft
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:
    
    cfec17e [Joseph K. Bradley] removed forgotten temp comments
    a27eb6d [Joseph K. Bradley] corrections to last log loss commit
    ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical 
stability
    5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also 
required updating the test suite since it effectively doubles the gradient and 
loss. * Added doc for developers within RandomForest. * Small cleanup in test 
suite (generating data only once)
    e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and 
updated doc for losses, forests, and boosting
    
    (cherry picked from commit c251fd7405db57d3ab2686c38712601fd8f13ccd)
    Signed-off-by: Xiangrui Meng <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [BRANCH-1.2][SPARK-4583][MLLIB] LogLoss for Gr...

Reply via email to