[GitHub] spark pull request: SPARK-897: preemptively serialize closures

willb Thu, 15 May 2014 06:17:59 -0700

GitHub user willb reopened a pull request:

    https://github.com/apache/spark/pull/143


    SPARK-897:  preemptively serialize closures

    These commits cause `ClosureCleaner.clean` to attempt to serialize the 
cleaned closure with the default closure serializer and throw a 
`SparkException` if doing so fails.  This behavior is enabled by default but 
can be disabled at individual callsites of `SparkContext.clean`.
    
    Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work 
exposed; I'm happy to put that in a separate PR if that would be more 
appropriate.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark spark-897

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #143
    
----
commit 5cd11d51c19321981a6234a7765c7a5be6913433
Author: Ivan Wick <[email protected]>
Date:   2014-04-11T00:49:30Z

    Set spark.executor.uri from environment variable (needed by Mesos)
    
    The Mesos backend uses this property when setting up a slave process.  It 
is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I 
couldn't find any analogous for pyspark.
    
    Author: Ivan Wick <[email protected]>
    
    This patch had conflicts when merged, resolved by
    Committer: Matei Zaharia <[email protected]>
    
    Closes #311 from ivanwick/master and squashes the following commits:
    
    da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable 
(needed by Mesos)

commit 7b4203ab4c640f7875ae3536228ed4d791062017
Author: Harvey Feng <[email protected]>
Date:   2014-04-11T01:25:54Z

    Add Spark v0.9.1 to ec2 launch script and use it as the default
    
    Mainly ported from branch-0.9.
    
    Author: Harvey Feng <[email protected]>
    
    Closes #385 from harveyfeng/0.9.1-ec2 and squashes the following commits:
    
    769ac2f [Harvey Feng] Add Spark v0.9.1 to ec2 launch script and use it as 
the default

commit 44f654eecd3c181f2aeaff3871acf7f00eacc6b9
Author: Patrick Wendell <[email protected]>
Date:   2014-04-11T03:43:56Z

    SPARK-1202: Improvements to task killing in the UI.
    
    1. Adds a separate endpoint for the killing logic that is outside of a page.
    2. Narrows the scope of the killingEnabled tracking.
    3. Some style improvements.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #386 from pwendell/kill-link and squashes the following commits:
    
    8efe02b [Patrick Wendell] Improvements to task killing in the UI.

commit 446bb3417a2855a194d49acc0ac316a021eced9d
Author: Thomas Graves <[email protected]>
Date:   2014-04-11T07:47:48Z

    SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken
    
    Author: Thomas Graves <[email protected]>
    
    Closes #344 from tgravescs/SPARK-1417 and squashes the following commits:
    
    c450b5f [Thomas Graves] fix test
    e1c1d7e [Thomas Graves] add missing $ to appUIAddress
    e982ddb [Thomas Graves] use appUIHostPort in appUIAddress
    0803ec2 [Thomas Graves] Review comment updates - remove extra newline, 
simplify assert in test
    658a8ec [Thomas Graves] Add a appUIHostPort routine
    0614208 [Thomas Graves] Fix test
    2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from 
resourcemanager is broken

commit 98225a6effd077a1b97c7e485d45ffd89b2c5b7f
Author: Patrick Wendell <[email protected]>
Date:   2014-04-11T17:45:27Z

    Some clean up in build/docs
    
    (a) Deleted an outdated line from the docs
    (b) Removed a work around that is no longer necessary given the mesos 
version bump.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #382 from pwendell/maven-clean and squashes the following commits:
    
    f0447fa [Patrick Wendell] Minor doc clean-up

commit f5ace8da34c58d1005c7c377cfe3df21102c1dd6
Author: Xiangrui Meng <[email protected]>
Date:   2014-04-11T19:06:13Z

    [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and 
BinaryClassificationMetrics
    
    This PR implements a generic version of `AreaUnderCurve` using the 
`RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It 
also contains refactoring of https://github.com/apache/spark/pull/160 for 
binary classification evaluation.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #364 from mengxr/auc and squashes the following commits:
    
    a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
    3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
    fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
    b1b7dab [Xiangrui Meng] fix code styles
    9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
    ca31da5 [Xiangrui Meng] remove PredictionAndResponse
    3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
    8f78958 [Xiangrui Meng] add PredictionAndResponse
    dda82d5 [Xiangrui Meng] add confusion matrix
    aa7e278 [Xiangrui Meng] add initial version of binary classification 
evaluator
    221ebce [Xiangrui Meng] add a new test to sliding
    a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
    a9b250a [Xiangrui Meng] move sliding to mllib
    cab9a52 [Xiangrui Meng] use last for the last element
    db6cb30 [Xiangrui Meng] remove unnecessary toSeq
    9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
    284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
    c1c6c22 [Xiangrui Meng] add AreaUnderCurve
    65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
    5ee6001 [Xiangrui Meng] add TODO
    d2a600d [Xiangrui Meng] add sliding to rdd

commit 6a0f8e35ce7595c4ece11fe04133fd44ffbe5b06
Author: Patrick Wendell <[email protected]>
Date:   2014-04-11T20:23:21Z

    HOTFIX: Ignore python metastore files in RAT checks.
    
    This was causing some errors with pull request tests.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #393 from pwendell/hotfix and squashes the following commits:
    
    6201dd3 [Patrick Wendell] HOTFIX: Ignore python metastore files in RAT 
checks.

commit 7038b00be9c84a4d92f9d95ff3d75fae47d57d87
Author: Xiangrui Meng <[email protected]>
Date:   2014-04-12T02:41:40Z

    [FIX] make coalesce test deterministic in RDDSuite
    
    Make coalesce test deterministic by setting pre-defined seeds. (Saw random 
failures in other PRs.)
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #387 from mengxr/fix-random and squashes the following commits:
    
    59bc16f [Xiangrui Meng] make coalesce test deterministic in RDDSuite

commit fdfb45e691946f3153d6c696bec6d7f3e391e301
Author: Xusen Yin <[email protected]>
Date:   2014-04-12T02:43:22Z

    [WIP] [SPARK-1328] Add vector statistics
    
    As with the new vector system in MLlib, we find that it is good to add some 
new APIs to precess the `RDD[Vector]`. Beside, the former implementation of 
`computeStat` is not stable which could loss precision, and has the possibility 
to cause `Nan` in scientific computing, just as said in the 
[SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).
    
    APIs contain:
    
    * rowMeans(): RDD[Double]
    * rowNorm2(): RDD[Double]
    * rowSDs(): RDD[Double]
    * colMeans(): Vector
    * colMeans(size: Int): Vector
    * colNorm2(): Vector
    * colNorm2(size: Int): Vector
    * colSDs(): Vector
    * colSDs(size: Int): Vector
    * maxOption((Vector, Vector) => Boolean): Option[Vector]
    * minOption((Vector, Vector) => Boolean): Option[Vector]
    * rowShrink(): RDD[Vector]
    * colShrink(): RDD[Vector]
    
    This is working in process now, and some more APIs will add to 
`LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to 
`MLContext` later.
    
    Author: Xusen Yin <[email protected]>
    Author: Xiangrui Meng <[email protected]>
    
    Closes #268 from yinxusen/vector-statistics and squashes the following 
commits:
    
    d61363f [Xusen Yin] rebase to latest master
    16ae684 [Xusen Yin] fix minor error and remove useless method
    10cf5d3 [Xusen Yin] refine some return type
    b064714 [Xusen Yin] remove computeStat in MLUtils
    cbbefdb [Xiangrui Meng] update multivariate statistical summary interface 
and clean tests
    4eaf28a [Xusen Yin] merge VectorRDDStatistics into RowMatrix
    48ee053 [Xusen Yin] fix minor error
    e624f93 [Xusen Yin] fix scala style error
    1fba230 [Xusen Yin] merge while loop together
    69e1f37 [Xusen Yin] remove lazy eval, and minor memory footprint
    548e9de [Xusen Yin] minor revision
    86522c4 [Xusen Yin] add comments on functions
    dc77e38 [Xusen Yin] test sparse vector RDD
    18cf072 [Xusen Yin] change def to lazy val to make sure that the 
computations in function be evaluated only once
    f7a3ca2 [Xusen Yin] fix the corner case of maxmin
    967d041 [Xusen Yin] full revision with Aggregator class
    138300c [Xusen Yin] add new Aggregator class
    1376ff4 [Xusen Yin] rename variables and adjust code
    4a5c38d [Xusen Yin] add scala doc, refine code and comments
    036b7a5 [Xusen Yin] fix the bug of Nan occur
    f6e8e9a [Xusen Yin] add sparse vectors test
    4cfbadf [Xusen Yin] fix bug of min max
    4e4fbd1 [Xusen Yin] separate seqop and combop out as independent functions
    a6d5a2e [Xusen Yin] rewrite for only computing non-zero elements
    3980287 [Xusen Yin] rename variables
    62a2c3e [Xusen Yin] use axpy and in-place if possible
    9a75ebd [Xusen Yin] add case class to wrap return values
    d816ac7 [Xusen Yin] remove useless APIs
    c4651bb [Xusen Yin] remove row-wise APIs and refine code
    1338ea1 [Xusen Yin] all-in-one version test passed
    cc65810 [Xusen Yin] add parallel mean and variance
    9af2e95 [Xusen Yin] refine the code style
    ad6c82d [Xusen Yin] add shrink test
    e09d5d2 [Xusen Yin] add scala docs and refine shrink method
    8ef3377 [Xusen Yin] pass all tests
    28cf060 [Xusen Yin] fix error of column means
    54b19ab [Xusen Yin] add new API to shrink RDD[Vector]
    8c6c0e1 [Xusen Yin] add basic statistics

commit aa8bb117a3ff98420ab751ba4ddbaad88ab57f9d
Author: baishuo(ç½ç¡) <[email protected]>
Date:   2014-04-12T03:33:42Z

    Update WindowedDStream.scala
    
    update the content of Exception when windowDuration is not multiple of 
parent.slideDuration
    
    Author: baishuo(ç½ç¡) <[email protected]>
    
    Closes #390 from baishuo/windowdstream and squashes the following commits:
    
    533c968 [baishuo(ç½ç¡)] Update WindowedDStream.scala

commit 165e06a74c3d75e6b7341c120943add8b035b96a
Author: Sean Owen <[email protected]>
Date:   2014-04-12T05:46:47Z

    SPARK-1057 (alternative) Remove fastutil
    
    (This is for discussion at this point -- I'm not suggesting this should be 
committed.)
    
    This is what removing fastutil looks like. Much of it is straightforward, 
like using `java.io` buffered stream classes, and Guava for murmurhash3.
    
    Uses of the `FastByteArrayOutputStream` were a little trickier. In only one 
case though do I think the change to use `java.io` actually entails an extra 
array copy.
    
    The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written 
in terms of more scala-like operations.
    
    `OpenHashMap` is where I made three non-trivial changes to make it work, 
and they need review:
    
    - It is no longer private
    - The key must be a `ClassTag`
    - Unless a lot of other code changes, the key type can't enforce being a 
supertype of `Null`
    
    It all works and tests pass, and I think there is reason to believe it's OK 
from a speed perspective.
    
    But what about those last changes?
    
    Author: Sean Owen <[email protected]>
    
    Closes #266 from srowen/SPARK-1057-alternate and squashes the following 
commits:
    
    2601129 [Sean Owen] Fix Map return type error not previously caught
    ec65502 [Sean Owen] Updates from matei's review
    00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, 
spark.util and Guava classes

commit 6aa08c39cf30fa5c4ed97f4fff16371b9030a2e6
Author: Tathagata Das <[email protected]>
Date:   2014-04-12T06:33:49Z

    [SPARK-1386] Web UI for Spark Streaming
    
    When debugging Spark Streaming applications it is necessary to monitor 
certain metrics that are not shown in the Spark application UI. For example, 
what is average processing time of batches? What is the scheduling delay? Is 
the system able to process as fast as it is receiving data? How many records I 
am receiving through my receivers?
    
    While the StreamingListener interface introduced in the 0.9 provided some 
of this information, it could only be accessed programmatically. A UI that 
shows information specific to the streaming applications is necessary for 
easier debugging. This PR introduces such a UI. It shows various statistics 
related to the streaming application. Here is a screenshot of the UI running on 
my local machine.
    
    http://i.imgur.com/1ooDGhm.png
    
    This UI is integrated into the Spark UI running at 4040.
    
    Author: Tathagata Das <[email protected]>
    Author: Andrew Or <[email protected]>
    
    Closes #290 from tdas/streaming-web-ui and squashes the following commits:
    
    fc73ca5 [Tathagata Das] Merge pull request #9 from andrewor14/ui-refactor
    642dd88 [Andrew Or] Merge SparkUISuite.scala into UISuite.scala
    eb30517 [Andrew Or] Merge github.com:apache/spark into ui-refactor
    f4f4cbe [Tathagata Das] More minor fixes.
    34bb364 [Tathagata Das] Merge branch 'streaming-web-ui' of 
github.com:tdas/spark into streaming-web-ui
    252c566 [Tathagata Das] Merge pull request #8 from andrewor14/ui-refactor
    e038b4b [Tathagata Das] Addressed Patrick's comments.
    125a054 [Andrew Or] Disable serving static resources with gzip
    90feb8d [Andrew Or] Address Patrick's comments
    89dae36 [Tathagata Das] Merge branch 'streaming-web-ui' of 
github.com:tdas/spark into streaming-web-ui
    72fe256 [Tathagata Das] Merge pull request #6 from andrewor14/ui-refactor
    2fc09c8 [Tathagata Das] Added binary check exclusions
    aa396d4 [Andrew Or] Rename tabs and pages (No more IndexPage.scala)
    f8e1053 [Tathagata Das] Added Spark and Streaming UI unit tests.
    caa5e05 [Tathagata Das] Merge branch 'streaming-web-ui' of 
github.com:tdas/spark into streaming-web-ui
    585cd65 [Tathagata Das] Merge pull request #5 from andrewor14/ui-refactor
    914b8ff [Tathagata Das] Moved utils functions to UIUtils.
    548c98c [Andrew Or] Wide refactoring of WebUI, UITab, and UIPage (see 
commit message)
    6de06b0 [Tathagata Das] Merge remote-tracking branch 'apache/master' into 
streaming-web-ui
    ee6543f [Tathagata Das] Minor changes based on Andrew's comments.
    fa760fe [Tathagata Das] Fixed long line.
    1c0bcef [Tathagata Das] Refactored streaming UI into two files.
    1af239b [Tathagata Das] Changed streaming UI to attach itself as a tab with 
the Spark UI.
    827e81a [Tathagata Das] Merge branch 'streaming-web-ui' of 
github.com:tdas/spark into streaming-web-ui
    168fe86 [Tathagata Das] Merge pull request #2 from andrewor14/ui-refactor
    3e986f8 [Tathagata Das] Merge remote-tracking branch 'apache/master' into 
streaming-web-ui
    c78c92d [Andrew Or] Remove outdated comment
    8f7323b [Andrew Or] End of file new lines, indentation, and imports (minor)
    0d61ee8 [Andrew Or] Merge branch 'streaming-web-ui' of 
github.com:tdas/spark into ui-refactor
    9a48fa1 [Andrew Or] Allow adding tabs to SparkUI dynamically + add example
    61358e3 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into streaming-web-ui
    53be2c5 [Tathagata Das] Minor style updates.
    ed25dfc [Andrew Or] Generalize SparkUI header to display tabs dynamically
    a37ad4f [Andrew Or] Comments, imports and formatting (minor)
    cd000b0 [Andrew Or] Merge github.com:apache/spark into ui-refactor
    7d57444 [Andrew Or] Refactoring the UI interface to add flexibility
    aef4dd5 [Tathagata Das] Added Apache licenses.
    db27bad [Tathagata Das] Added last batch processing time to StreamingUI.
    4d86e98 [Tathagata Das] Added basic stats to the StreamingUI and refactored 
the UI to a Page to make it easier to transition to using SparkUI later.
    93f1c69 [Tathagata Das] Added network receiver information to the Streaming 
UI.
    56cc7fb [Tathagata Das] First cut implementation of Streaming UI.

commit c2d160fbee2ef90a7683d9771f2f632b68d74aef
Author: Andrew Or <[email protected]>
Date:   2014-04-12T23:33:38Z

    [Fix #204] Update out-dated comments
    
    This PR is self-explanatory.
    
    Author: Andrew Or <[email protected]>
    
    Closes #381 from andrewor14/master and squashes the following commits:
    
    3e8dde2 [Andrew Or] Fix comments for #204

commit ca11919e6e97a62eb3e3ce882ffa29eae36f50f7
Author: Bharath Bhushan <[email protected]>
Date:   2014-04-13T03:52:29Z

    [SPARK-1403] Move the class loader creation back to where it was in 0.9.0
    
    [SPARK-1403] I investigated why spark 0.9.0 loads fine on mesos while spark 
1.0.0 fails. What I found was that in SparkEnv.scala, while creating the 
SparkEnv object, the current thread's classloader is null. But in 0.9.0, at the 
same place, it is set to org.apache.spark.repl.ExecutorClassLoader . I saw that 
https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01 
moved it to it current place. I moved it back and saw that 1.0.0 started 
working fine on mesos.
    
    I just created a minimal patch that allows me to run spark on mesos 
correctly. It seems like SecurityManager's creation needs to be taken into 
account for a correct fix. Also moving the creation of the serializer out of 
SparkEnv might be a part of the right solution. PTAL.
    
    Author: Bharath Bhushan <[email protected]>
    
    Closes #322 from manku-timma/spark-1403 and squashes the following commits:
    
    606c2b9 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    ec8f870 [Bharath Bhushan] revert the logger change for java 6 compatibility 
as PR 334 is doing it
    728beca [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    044027d [Bharath Bhushan] fix compile error
    6f260a4 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    b3a053f [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    04b9662 [Bharath Bhushan] add missing line
    4803c19 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    f3c9a14 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' 
into spark-1403
    42d3d6a [Bharath Bhushan] used code fragment from @ueshin to fix the 
problem in a better way
    89109d7 [Bharath Bhushan] move the class loader creation back to where it 
was in 0.9.0

commit 4bc07eebbf5e2ea0c0b6f1642049515025d88d07
Author: Patrick Wendell <[email protected]>
Date:   2014-04-13T15:58:37Z

    SPARK-1480: Clean up use of classloaders
    
    The Spark codebase is a bit fast-and-loose when accessing classloaders and 
this has caused a few bugs to surface in master.
    
    This patch defines some utility methods for accessing classloaders. This 
makes the intention when accessing a classloader much more explicit in the code 
and fixes a few cases where the wrong one was chosen.
    
    case (a) -> We want the classloader that loaded Spark
    case (b) -> We want the context class loader, or if not present, we want (a)
    
    This patch provides a better fix for SPARK-1403 
(https://issues.apache.org/jira/browse/SPARK-1403) than the current work 
around, which it reverts. It also fixes a previously unreported bug that the 
`./spark-submit` script did not work for running with `local` master. It didn't 
work because the executor classloader did not properly delegate to the context 
class loader (if it is defined) and in local mode the context class loader is 
set by the `./spark-submit` script. A unit test is added for that case.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #398 from pwendell/class-loaders and squashes the following commits:
    
    b4a1a58 [Patrick Wendell] Minor clean up
    14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders

commit 037fe4d2ba01be5610baa3dd9c5c9d3a5e5e1064
Author: Xusen Yin <[email protected]>
Date:   2014-04-13T20:18:52Z

    [SPARK-1415] Hadoop min split for wholeTextFiles()
    
    JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-1415).
    
    New Hadoop API of `InputFormat` does not provide the `minSplits` parameter, 
which makes the API incompatible between `HadoopRDD` and `NewHadoopRDD`. The PR 
is for constructing compatible APIs.
    
    Though `minSplits` is deprecated by New Hadoop API, we think it is better 
to make APIs compatible here.
    
    **Note** that `minSplits` in `wholeTextFiles` could only be treated as a 
*suggestion*, the real number of splits may not be greater than `minSplits` due 
to `isSplitable()=false`.
    
    Author: Xusen Yin <[email protected]>
    
    Closes #376 from yinxusen/hadoop-min-split and squashes the following 
commits:
    
    76417f6 [Xusen Yin] refine comments
    c10af60 [Xusen Yin] refine comments and rewrite new class for wholeTextFile
    766d05b [Xusen Yin] refine Java API and comments
    4875755 [Xusen Yin] add minSplits for WholeTextFiles

commit 7dbca68e92416ec5f023c8807bb06470c01a6d3a
Author: Cheng Lian <[email protected]>
Date:   2014-04-14T22:22:43Z

    [BUGFIX] In-memory columnar storage bug fixes
    
    Fixed several bugs of in-memory columnar storage to make 
`HiveInMemoryCompatibilitySuite` pass.
    
    @rxin @marmbrus It is reasonable to include 
`HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it 
significantly increases test execution time. What do you think?
    
    **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in 
memory. `HiveInMemoryCompatibilitySuite` was removed.
    
    Author: Cheng Lian <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes #374 from liancheng/inMemBugFix and squashes the following commits:
    
    6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and 
HiveInMemoryCompatibilitySuite
    5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
    882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
    32cc9ce [Cheng Lian] Code style cleanup
    99382bf [Cheng Lian] Enable compression by default
    4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
    d1df4fd [Michael Armbrust] Remove test tables that might always get created 
anyway?
    ab9e807 [Michael Armbrust] Fix the logged console version of failed test 
cases to use the new syntax.
    1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a 
single partition, as it does not work correctly with mutable rows.
    e36cdd0 [Michael Armbrust] Spelling.
    2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
    6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and 
InMemoryColumnarTableScan
    c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
    9c8fc40 [Cheng Lian] Disable compression by default
    e619995 [Cheng Lian] Bug fix: incorrect byte order in 
CompressionScheme.columnHeaderSize
    8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache 
columns specified by the attributes argument
    036cd09 [Cheng Lian] Clean up unused imports
    44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take 
nulls into account
    052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for 
non-null values
    95b3301 [Cheng Lian] Fixed bugs in IntegralDelta

commit 268b53567c93538c03cb66276ed9e05c9f1d3ac6
Author: Patrick Wendell <[email protected]>
Date:   2014-04-14T22:51:54Z

    HOTFIX: Use file name and not paths for excludes

commit 0247b5c5467ca1b0d03ba929a78fa4d805582d84
Author: Sean Owen <[email protected]>
Date:   2014-04-15T02:50:00Z

    SPARK-1488. Resolve scalac feature warnings during build
    
    For your consideration: scalac currently notes a number of feature warnings 
during compilation:
    
    ```
    [warn] there were 65 feature warning(s); re-run with -feature for details
    ```
    
    Warnings are like:
    
    ```
    [warn] 
/Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261:
 implicit conversion method rddToPairRDDFunctions should be enabled
    [warn] by making the implicit value scala.language.implicitConversions 
visible.
    [warn] This can be achieved by adding the import clause 'import 
scala.language.implicitConversions'
    [warn] or by setting the compiler option -language:implicitConversions.
    [warn] See the Scala docs for value scala.language.implicitConversions for 
a discussion
    [warn] why the feature should be explicitly enabled.
    [warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: 
RDD[(K, V)]) =
    [warn]                ^
    ```
    
    scalac is suggesting that it's just best practice to explicitly enable 
certain language features by importing them where used.
    
    This PR simply adds the imports it suggests (and squashes one other Java 
warning along the way). This leaves just deprecation warnings in the build.
    
    Author: Sean Owen <[email protected]>
    
    Closes #404 from srowen/SPARK-1488 and squashes the following commits:
    
    8598980 [Sean Owen] Quiet scalac warnings about language features by 
explicitly importing language features.
    39bc831 [Sean Owen] Enable -feature in scalac to emit language feature 
warnings

commit c99bcb7feaa761c5826f2e1d844d0502a3b79538
Author: Ahir Reddy <[email protected]>
Date:   2014-04-15T07:07:55Z

    SPARK-1374: PySpark API for SparkSQL
    
    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD 
composed of dictionaries, with string keys and primitive values (boolean, 
float, int, long, string) can be converted into a SchemaRDD that supports sql 
queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, 
"field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": 
"row2"}, {"f1" : 3, "f2": "row3"}]```
    
    Author: Ahir Reddy <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes #363 from ahirreddy/pysql and squashes the following commits:
    
    0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
    307d6e0 [Ahir Reddy] Style fix
    6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble 
Spark jar with Hive, we don't want to check the interfaces of all of our hive 
dependencies
    3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
    29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and 
caching PythonRDD
    f2312c7 [Ahir Reddy] Moved everything into sql.py
    a19afe4 [Ahir Reddy] Doc fixes
    6d658ba [Ahir Reddy] Remove the metastore directory created by the 
HiveContext tests in SparkSQL
    521ff6d [Ahir Reddy] Trying to get spark to build with hive
    ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
    ded03e7 [Ahir Reddy] Added doc test for HiveContext
    22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
    e4da06c [Ahir Reddy] Display message if hive is not built into spark
    227a0be [Michael Armbrust] Update API links. Fix Hive example.
    58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
    4285340 [Michael Armbrust] Fix building of Hive API Docs.
    38a92b0 [Michael Armbrust] Add note to future non-python developers about 
python docs.
    337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 
2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
    40491c9 [Ahir Reddy] PR Changes + Method Visibility
    1836944 [Michael Armbrust] Fix comments.
    e00980f [Michael Armbrust] First draft of python sql programming guide.
    b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit 
test
    f98a422 [Ahir Reddy] HiveContexts
    79621cf [Ahir Reddy] cleaning up cruft
    b406ba0 [Ahir Reddy] doctest formatting
    20936a5 [Ahir Reddy] Added tests and documentation
    e4d21b4 [Ahir Reddy] Added pyrolite dependency
    79f739d [Ahir Reddy] added more tests
    7515ba0 [Ahir Reddy] added more tests :)
    d26ec5e [Ahir Reddy] added test
    e9f5b8d [Ahir Reddy] adding tests
    906d180 [Ahir Reddy] added todo explaining cost of creating Row object in 
python
    251f99d [Ahir Reddy] for now only allow dictionaries as input
    09b9980 [Ahir Reddy] made jrdd explicitly lazy
    c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
    725c91e [Ahir Reddy] awesome row objects
    55d1c76 [Ahir Reddy] return row objects
    4fe1319 [Ahir Reddy] output dictionaries correctly
    be079de [Ahir Reddy] returning dictionaries works
    cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
    e948bd9 [Ahir Reddy] yippie
    4886052 [Ahir Reddy] even better
    c0fb1c6 [Ahir Reddy] more working
    043ca85 [Ahir Reddy] working
    5496f9f [Ahir Reddy] doesn't crash
    b8b904b [Ahir Reddy] Added schema rdd class
    67ba875 [Ahir Reddy] java to python, and python to java
    bcc0f23 [Ahir Reddy] Java to python
    ab6025d [Ahir Reddy] compiling

commit df360917990ad95dde3c8e016ec42507d1566355
Author: Sandeep <[email protected]>
Date:   2014-04-15T07:19:43Z

    SPARK-1426: Make MLlib work with NumPy versions older than 1.7
    
    Currently it requires NumPy 1.7 due to using the copyto method 
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for 
extracting data out of an array.
    Replace it with a fallback
    
    Author: Sandeep <[email protected]>
    
    Closes #391 from techaddict/1426 and squashes the following commits:
    
    d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older 
than 1.7 Currently it requires NumPy 1.7 due to using the copyto method 
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for 
extracting data out of an array. Replace it with a fallback

commit 2580a3b1a06188fa97d9440d793c8835ef7384b0
Author: William Benton <[email protected]>
Date:   2014-04-15T17:38:42Z

    SPARK-1501: Ensure assertions in Graph.apply are asserted.
    
    The Graph.apply test in GraphSuite had some assertions in a closure in
    a graph transformation. As a consequence, these assertions never
    actually executed.  Furthermore, these closures had a reference to
    (non-serializable) test harness classes because they called assert(),
    which could be a problem if we proactively check closure serializability
    in the future.
    
    This commit simply changes the Graph.apply test to collect the graph
    triplets so it can assert about each triplet from a map method.
    
    Author: William Benton <[email protected]>
    
    Closes #415 from willb/graphsuite-nop-fix and squashes the following 
commits:
    
    0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.

commit 6843d637e72e5262d05cfa2b1935152743f2bd5a
Author: DB Tsai <[email protected]>
Date:   2014-04-15T18:12:47Z

    [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.
    
    This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !
    
    When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.
    
    Let's review how updater works when returning newWeights given the input 
parameters.
    
    w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
    If we set gradient = 0, thisIterStepSize = 1, then
    regGradient(w) = w - w'
    
    As a result, for regVal, it can be computed by
    
        val regVal = updater.compute(
          weights,
          new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
    and for regGradient, it can be obtained by
    
          val regGradient = weights.sub(
            updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)
    
    The PR includes the tests which compare the result with SGD with/without 
regularization.
    
    We did a comparison between LBFGS and SGD, and often we saw 10x less
    steps in LBFGS while the cost of per step is the same (just computing
    the gradient).
    
    The following is the paper by Prof. Ng at Stanford comparing different
    optimizers including LBFGS and SGD. They use them in the context of
    deep learning, but worth as reference.
    http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf
    
    Author: DB Tsai <[email protected]>
    
    Closes #353 from dbtsai/dbtsai-LBFGS and squashes the following commits:
    
    984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also 
fixed indentation issue in GradientDescent optimizer.

commit 07d72fe6965aaf299d61bf6156d48bcfebc41b32
Author: Manish Amde <[email protected]>
Date:   2014-04-15T18:14:28Z

    Decision Tree documentation for MLlib programming guide
    
    Added documentation for user to use the decision tree algorithms for 
classification and regression in Spark 1.0 release.
    
    Apart from a general review, I need specific input on the following:
    * I had to move a lot of the existing documentation under the *linear 
methods* umbrella to accommodate decision trees. I wonder if there is a better 
way to organize the programming guide given we are so close to the release.
    * I have not looked closely at pyspark but I am wondering new mllib 
algorithms are automatically plugged in or do we need to some extra work to 
call mllib functions from pyspark. I will add to the pyspark examples based 
upon the advice I get.
    
    cc: @mengxr, @hirakendu, @etrain, @atalwalkar
    
    Author: Manish Amde <[email protected]>
    
    Closes #402 from manishamde/tree_doc and squashes the following commits:
    
    022485a [Manish Amde] more documentation
    865826e [Manish Amde] minor: grammar
    dbb0e5e [Manish Amde] minor improvements to text
    b9ef6c4 [Manish Amde] basic decision tree code examples
    6e297d7 [Manish Amde] added subsections
    f427e84 [Manish Amde] renaming sections
    9c0c4be [Manish Amde] split candidate
    6925275 [Manish Amde] impurity and information gain
    94fd2f9 [Manish Amde] more reorg
    b93125c [Manish Amde] more subsection reorg
    3ecb2ad [Manish Amde] minor text addition
    1537dd3 [Manish Amde] added placeholders and some doc
    d06511d [Manish Amde] basic skeleton

commit 5aaf9836f108d4ef9afe809353ad4d3aed560368
Author: Patrick Wendell <[email protected]>
Date:   2014-04-16T02:34:39Z

    SPARK-1455: Better isolation for unit tests.
    
    This is a simple first step towards avoiding running the Hive tests
    whenever possible.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #420 from pwendell/test-isolation and squashes the following commits:
    
    350c8af [Patrick Wendell] SPARK-1455: Better isolation for unit tests.

commit 8517911efb89aade61c8b8c54fee216dae9a4b4f
Author: Xiangrui Meng <[email protected]>
Date:   2014-04-16T02:37:32Z

    [FIX] update sbt-idea to version 1.6.0
    
    I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It 
seems upgrading `sbt-idea` to 1.6.0 fixed the problem.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #419 from mengxr/idea-plugin and squashes the following commits:
    
    fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0

commit 63ca581d9c84176549b1ea0a1d8d7c0cca982acc
Author: Matei Zaharia <[email protected]>
Date:   2014-04-16T03:33:24Z

    [WIP] SPARK-1430: Support sparse data in Python MLlib
    
    This PR adds a SparseVector class in PySpark and updates all the 
regression, classification and clustering algorithms and models to support 
sparse data, similar to MLlib. I chose to add this class because SciPy is quite 
difficult to install in many environments (more so than NumPy), but I plan to 
add support for SciPy sparse vectors later too, and make the methods work 
transparently on objects of either type.
    
    On the Scala side, we keep Python sparse vectors sparse and pass them to 
MLlib. We always return dense vectors from our models.
    
    Some to-do items left:
    - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. 
We can easily add a function to convert these to our own SparseVector.
    - [x] MLlib currently uses a vector with one extra column on the left to 
represent what we call LabeledPoint in Scala. Do we really want this? It may 
get annoying once you deal with sparse data since you must add/subtract 1 to 
each feature index when training. We can remove this API in 1.0 and use tuples 
for labeling.
    - [x] Explain how to use these in the Python MLlib docs.
    
    CC @mengxr, @joshrosen
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #341 from mateiz/py-ml-update and squashes the following commits:
    
    d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle 
review comments
    ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
    b9f97a3 [Matei Zaharia] Fix test
    1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
    88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and 
expose its parametrs
    37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib 
API
    da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to 
discuss sparse data
    c48e85a [Matei Zaharia] Added some tests for passing lists as input, and 
added mllib/tests.py to run-tests script.
    a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
    74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
    889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms 
and models
    ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
    a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
    0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading 
LabeledPoints
    eaee759 [Matei Zaharia] Update regression, classification and clustering 
models for sparse data
    2abbb44 [Matei Zaharia] Further work to get linear models working with 
sparse data
    154f45d [Matei Zaharia] Update docs, name some magic values
    881fef7 [Matei Zaharia] Added a sparse vector in Python and made 
Java-Python format more compact

commit 273c2fd08deb49e970ec471c857dcf0b2953f922
Author: Michael Armbrust <[email protected]>
Date:   2014-04-16T03:40:40Z

    [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs
    
    This makes it possible to create tables and insert into them using the DSL 
and SQL for the scala and java apis.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #354 from marmbrus/insertIntoTable and squashes the following 
commits:
    
    6c6f227 [Michael Armbrust] Create random temporary files in python parquet 
unit tests.
    f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into insertIntoTable
    765c506 [Michael Armbrust] Add to JavaAPI.
    77b512c [Michael Armbrust] typos.
    5c3ef95 [Michael Armbrust] use names for boolean args.
    882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up 
api annotations.
    d07d94b [Michael Armbrust] Add tests, support for creating parquet files 
and hive tables.
    fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as 
well.  Add createTableAs function.

commit 6a10d801626f1513b1b349b54ba0e2e6bf55c7e2
Author: Cheng Lian <[email protected]>
Date:   2014-04-16T15:52:14Z

    [SPARK-959] Updated SBT from 0.13.1 to 0.13.2
    
    JIRA issue: 
[SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959)
    
    SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to 
Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). 
This PR also removed previous workaround.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #426 from liancheng/updateSbt and squashes the following commits:
    
    95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959

commit c0273d806ea9b83dd8585039f2a18c2cc795dad2
Author: Marcelo Vanzin <[email protected]>
Date:   2014-04-16T15:53:01Z

    Make "spark logo" link refer to "/".
    
    This is not an issue with the driver UI, but when you fire
    up the history server, there's currently no way to go back to
    the app listing page without editing the browser's location
    field (since the logo's link points to the root of the
    application's own UI - i.e. the "stages" tab).
    
    The change just points the logo link to "/", which is the app
    listing for the history server, and the stages tab for the
    driver's UI.
    
    Tested with both history server and live driver.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #408 from vanzin/web-ui-root and squashes the following commits:
    
    1b60cb6 [Marcelo Vanzin] Make "spark logo" link refer to "/".

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-897: preemptively serialize closures

Reply via email to