[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

debasish83 Tue, 11 Nov 2014 23:37:30 -0800

GitHub user debasish83 reopened a pull request:

    https://github.com/apache/spark/pull/2705


    [MLLIB] [WIP] SPARK-2426: Quadratic Minimization for MLlib ALS

    ALS is a generic algorithm for matrix factorization which is equally 
applicable for both feature space and similarity space. Current ALS support L2 
regularization and positivity constraint. This PR introduces userConstraint and 
productConstraint to ALS and let's the user select different constraints for 
user and product solves. The supported constraints are the following:
    
    1. SMOOTH : default ALS with L2 regularization
    2. POSITIVE: ALS with positive factors
    3. BOUNDS: ALS with factors bounded within upper and lower bound (default 
within 0 and 1)
    4. SPARSE: ALS with L1 regularization
    5. EQUALITY: ALS with equality constraint (default the factors sum up to 1 
and positive)
    
    First let's focus on the problem formulation. Both implicit and explicit 
feedback ALS formulation can be written as a quadratic minimization problem. 
The quadratic objective can be written as xtHx + ctx. Each of the respective 
constraints take the following form:
    minimize xtHx + ctx
    s.t ||x||1 <= c (SPARSE constraint)
    
    We rewrite the objective as f(x) = xtHx + ctx and the constraint as an 
indicator function g(x)
    
    Now minimization of f(x) + g(x) can be carried out using various forward 
backward splitting algorithms. We choose ADMM for the first version based on 
our experimentation with ECOS IP solver and MOSEK comparisons. I will document 
the comparisons.
    
    Details of the algorithm are in the following reference:
    http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
    
    Right now the default parameters of alpha, rho are set as 1.0 but the 
following issues show up in experiments with MovieLens dataset:
    1. ~3X higher iterations as compared to NNLS
    2. For SPARSE we are hitting the max iterations (400) around 10% of the time
    3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd 
on optimal control
    
    We choose ADMM as the baseline solver but this PR will explore the 
following solver enhancements to decrease the iteration count:
    1. Accelerated ADMM using Nesterov acceleration
    2. FISTA style forward backward splitting
    
    For use-cases the PR is focused on the following:
    
    1. Sparse matrix factorization to improve recommendation 
    On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than 
the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k 
measures. Using the PR from @coderxiang to look into IR measures.
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 
20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 
--productConstraint SPARSE --lambdaProduct 0.1 --kryo 
hdfs://localhost:8020/sandbox/movielens/
       
    2. Topic modeling using LSA
    References:
    2007 Sparse coding: 
papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in 
Graphlab): 
    https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
    2012 Sparse Coding + MR/MPI Microsoft: 
http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement 
over LDA based topic modeling.
    
    3. Topic modeling using PLSA
    Reference: 
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the 
KL divergence loss being used in PLSA. We are interested to see if it improves 
the result further as compared to the Sparse coding.
    
    Next steps:
    1. Improve the convergence rate of forward-backward splitting on quadratic 
problems
    2. Move the test-cases to QuadraticMinimizerSuite.scala
    3. Generate results for each of the use-cases and add tests related to each 
use-case
    
    Related future PRs:
    1. Scale the factorization rank and remove the need to construct H matrix
    2. Replace the quadratic loss xtHx + ctx with a Convex loss

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/debasish83/spark qp-als

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2705
    
----
commit 9c439d33160ef3b31173381735dfa8cfb7d552ba
Author: Xiangrui Meng <[email protected]>
Date:   2014-10-09T05:35:14Z

    [SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgrade
    
    Got warning msg:
    
    ~~~
    [warn] 
/Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50:
 method norm in trait NumericOps is deprecated: Use norm(XXX) instead of 
XXX.norm
    [warn]     var norm = vector.toBreeze.norm(p)
    ~~~
    
    dbtsai
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #2718 from mengxr/SPARK-3856 and squashes the following commits:
    
    4f38169 [Xiangrui Meng] use norm operator

commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37
Author: Anand Avati <[email protected]>
Date:   2014-10-09T06:45:17Z

    [SPARK-2805] Upgrade to akka 2.3.4
    
    Upgrade to akka 2.3.4
    
    Author: Anand Avati <[email protected]>
    
    Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following 
commits:
    
    57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on 
akka.actor.IO
    2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4

commit 86b392942daf61fed2ff7490178b128107a0e856
Author: Xiangrui Meng <[email protected]>
Date:   2014-10-09T07:00:24Z

    [SPARK-3844][UI] Truncate appName in WebUI if it is too long
    
    Truncate appName in WebUI if it is too long.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #2707 from mengxr/truncate-app-name and squashes the following 
commits:
    
    87834ce [Xiangrui Meng] move scala import below java
    c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long

commit 13cab5ba44e2f8d2d2204b3b0d39d7c23a819bdb
Author: nartz <[email protected]>
Date:   2014-10-09T07:02:11Z

    add spark.driver.memory to config docs
    
    It took me a minute to track this down, so I thought it could be useful to 
have it in the docs.
    
    I'm unsure if 512mb is the default for spark.driver.memory? Also - there 
could be a better value for the 'description' to differentiate it from 
spark.executor.memory.
    
    Author: nartz <[email protected]>
    Author: Nathan Artz <[email protected]>
    
    Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and 
squashes the following commits:
    
    a2f6c62 [nartz] Update configuration.md
    74521b8 [Nathan Artz] add spark.driver.memory to config docs

commit 14f222f7f76cc93633aae27a94c0e556e289ec56
Author: Qiping Li <[email protected]>
Date:   2014-10-09T08:36:58Z

    [SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training
    
    Currently, the implementation does one unnecessary aggregation step. The 
aggregation step for level L (to choose splits) gives enough information to set 
the predictions of any leaf nodes at level L+1. We can use that info and skip 
the aggregation step for the last level of the tree (which only has leaf nodes).
    
    ### Implementation Details
    
    Each node now has a `impurity` field and the `predict` is changed from type 
`Double` to type `Predict`(this can be used to compute predict probability in 
the future) When compute best splits for each node, we also compute impurity 
and predict for the child nodes, which is used to constructed newly allocated 
child nodes. So at level L, we have set impurity and predict for nodes at level 
L +1.
    If level L+1 is the last level, then we can avoid aggregation. What's more, 
calculation of parent impurity in
    
    Top nodes for each tree needs to be treated differently because we have to 
compute impurity and predict for them first. In `binsToBestSplit`, if current 
node is top node(level == 0), we calculate impurity and predict first.
    after finding best split, top node's predict and impurity is set to the 
calculated value. Non-top nodes's impurity and predict are already calculated 
and don't need to be recalculated again. I have considered to add a 
initialization step to set top nodes' impurity and predict and then we can 
treat all nodes in the same way, but this will need a lot of duplication of 
code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I 
choose the current way.
    
     CC mengxr manishamde jkbradley, please help me review this, thanks.
    
    Author: Qiping Li <[email protected]>
    
    Closes #2708 from chouqin/avoid-agg and squashes the following commits:
    
    8e269ea [Qiping Li] adjust code and comments
    eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
    c41b1b6 [Qiping Li] fix pyspark unit test
    7ad7a71 [Qiping Li] fix unit test
    822c912 [Qiping Li] add comments and unit test
    e41d715 [Qiping Li] fix bug in test suite
    6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree 
training

commit 1e0aa4deba65aa1241b9a30edb82665eae27242f
Author: GuoQiang Li <[email protected]>
Date:   2014-10-09T16:22:32Z

    [Minor] use norm operator after breeze 0.10 upgrade
    
    cc mengxr
    
    Author: GuoQiang Li <[email protected]>
    
    Closes #2730 from witgo/SPARK-3856 and squashes the following commits:
    
    2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade

commit 73bf3f2e0c03216aa29c25fea2d97205b5977903
Author: zsxwing <[email protected]>
Date:   2014-10-09T18:27:21Z

    [SPARK-3741] Make ConnectionManager propagate errors properly and add mo...
    
    ...re logs to avoid Executors swallowing errors
    
    This PR made the following changes:
    * Register a callback to `Connection` so that the error will be propagated 
properly.
    * Add more logs so that the errors won't be swallowed by Executors.
    * Use trySuccess/tryFailure because `Promise` doesn't allow to call 
success/failure more than once.
    
    Author: zsxwing <[email protected]>
    
    Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits:
    
    1d5aed5 [zsxwing] Fix naming
    0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
    764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors 
properly and add more logs to avoid Executors swallowing errors

commit b77a02f41c60d869f48b65e72ed696c05b30bc48
Author: Vida Ha <[email protected]>
Date:   2014-10-09T20:13:31Z

    [SPARK-3752][SQL]: Add tests for different UDF's
    
    Author: Vida Ha <[email protected]>
    
    Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits:
    
    d7fdbbc [Vida Ha] Add tests for different UDF's

commit 752e90f15e0bb82d283f05eff08df874b48caed9
Author: Yash Datta <[email protected]>
Date:   2014-10-09T19:59:14Z

    [SPARK-3711][SQL] Optimize where in clause filter queries
    
    The In case class is replaced by a InSet class in case all the filters are 
literals, which uses a hashset instead of Sequence, thereby giving significant 
performance improvement (earlier the seq was using a worst case linear match 
(exists method) since expressions were assumed in the filter list) . Maximum 
improvement should be visible in case small percentage of large data matches 
the filter list.
    
    Author: Yash Datta <[email protected]>
    
    Closes #2561 from saucam/branch-1.1 and squashes the following commits:
    
    4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order         
    2. Fix optimization condition             3. Add tests for null in filter 
list             4. Add test case that optimization is not triggered in case of 
attributes in filter list
    afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in 
ExpressionEvaluationSuite             2. Add class OptimizedInSuite on the 
lines of ConstantFoldingSuite, for the optimized In clause
    0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by 
constantFolding
    bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move 
optimization of In clause to Optimizer.scala by adding a rule. Add appropriate 
comments
    430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of 
negative values as well
    bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries

commit 2c8851343a2e4d1d5b3a2b959eaa651a92982a72
Author: scwf <[email protected]>
Date:   2014-10-09T20:22:36Z

    [SPARK-3806][SQL] Minor fix for CliSuite
    
    To fix two issues in CliSuite
    1 CliSuite throw IndexOutOfBoundsException:
    Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
        at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
        at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
        at 
org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
        at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
        at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
        at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
        at 
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
        at 
scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
        at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
        at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
        at 
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
        at 
scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
        at 
scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
        at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)
    
    Actually, it is the Mutil-Threads lead to this problem.
    
    2 Using ```line.startsWith``` instead ```line.contains``` to assert 
expected answer. This is a tiny bug in CliSuite, for test case "Simple 
commands", there is a expected answers "5", if we use ```contains``` that means 
output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" 
or "14/10/06 11:54:36 INFO StatsReportListener:  0%      ```5```%        10%    
 25%     50%     75%     90%     95%     100%" will make the assert true.
    
    Author: scwf <[email protected]>
    
    Closes #2666 from scwf/clisuite and squashes the following commits:
    
    11430db [scwf] fix-clisuite

commit e7edb723d22869f228b838fd242bf8e6fe73ee19
Author: cocoatomo <[email protected]>
Date:   2014-10-09T20:46:26Z

    [SPARK-3868][PySpark] Hard to recognize which module is tested from 
unit-tests.log
    
    ./python/run-tests script display messages about which test it is running 
currently on stdout but not write them on unit-tests.log.
    It is harder for us to recognize what test programs were executed and which 
test was failed.
    
    Author: cocoatomo <[email protected]>
    
    Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and 
squashes the following commits:
    
    c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is 
tested from unit-tests.log

commit ec4d40e48186af18e25517e0474020720645f583
Author: Mike Timper <[email protected]>
Date:   2014-10-09T21:02:27Z

    [SPARK-3853][SQL] JSON Schema support for Timestamp fields
    
    In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType 
function and a toTimestamp function.
    
    Author: Mike Timper <[email protected]>
    
    Closes #2720 from mtimper/master and squashes the following commits:
    
    9386ab8 [Mike Timper] Fix and tests for SPARK-3853

commit 1faa1135a3fc0acd89f934f01a4a2edefcb93d33
Author: Patrick Wendell <[email protected]>
Date:   2014-10-09T21:50:36Z

    Revert "[SPARK-2805] Upgrade to akka 2.3.4"
    
    This reverts commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37.

commit 1c7f0ab302de9f82b1bd6da852d133823bc67c66
Author: Yin Huai <[email protected]>
Date:   2014-10-09T21:57:27Z

    [SPARK-3339][SQL] Support for skipping json lines that fail to parse
    
    This PR aims to provide a way to skip/query corrupt JSON records. To do so, 
we introduce an internal column to hold corrupt records (the default name is 
`_corrupt_record`. This name can be changed by setting the value of 
`spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will 
put the corrupt record in its unparsed format to the internal column. Users can 
skip/query this column through SQL.
    
    * To query those corrupt records
    ```
    -- For Hive parser
    SELECT `_corrupt_record`
    FROM jsonTable
    WHERE `_corrupt_record` IS NOT NULL
    -- For our SQL parser
    SELECT _corrupt_record
    FROM jsonTable
    WHERE _corrupt_record IS NOT NULL
    ```
    * To skip corrupt records and query regular records
    ```
    -- For Hive parser
    SELECT field1, field2
    FROM jsonTable
    WHERE `_corrupt_record` IS NULL
    -- For our SQL parser
    SELECT field1, field2
    FROM jsonTable
    WHERE _corrupt_record IS NULL
    ```
    
    Generally, it is not recommended to change the name of the internal column. 
If the name has to be changed to avoid possible name conflicts, you can use 
`sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` 
or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.
    
    Author: Yin Huai <[email protected]>
    
    Closes #2680 from yhuai/corruptJsonRecord and squashes the following 
commits:
    
    4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
corruptJsonRecord
    309616a [Yin Huai] Change the default name of corrupt record to 
"_corrupt_record".
    b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
corruptJsonRecord
    9375ae9 [Yin Huai] Set the column name of corrupt json record back to the 
default one after the unit test.
    ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed 
strings.

commit 0c0e09f567deb775ee378f5385a16884f68b332d
Author: Daoyuan Wang <[email protected]>
Date:   2014-10-09T21:59:03Z

    [SPARK-3412][SQL]add missing row api
    
    chenghao-intel assigned this to me, check PR #2284 for previous discussion
    
    Author: Daoyuan Wang <[email protected]>
    
    Closes #2529 from adrian-wang/rowapi and squashes the following commits:
    
    c6594b2 [Daoyuan Wang] using boxed
    7b7e6e3 [Daoyuan Wang] update pattern match
    7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
    4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
    1614493 [Daoyuan Wang] add missing row api

commit bc3b6cb06153d6b05f311dd78459768b6cf6a404
Author: Nathan Howell <[email protected]>
Date:   2014-10-09T22:03:01Z

    [SPARK-3858][SQL] Pass the generator alias into logical plan node
    
    The alias parameter is being ignored, which makes it more difficult to 
specify a qualifier for Generator expressions.
    
    Author: Nathan Howell <[email protected]>
    
    Closes #2721 from NathanHowell/SPARK-3858 and squashes the following 
commits:
    
    8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into 
logical plan node

commit ac302052870a650d56f2d3131c27755bb2960ad7
Author: ravipesala <[email protected]>
Date:   2014-10-09T22:14:58Z

    [SPARK-3813][SQL] Support "case when" conditional functions in Spark SQL.
    
    "case when" conditional function is already supported in Spark SQL but 
there is no support in SqlParser. So added parser support to it.
    
    Author : ravipesala ravindra.pesalahuawei.com
    
    Author: ravipesala <[email protected]>
    
    Closes #2678 from ravipesala/SPARK-3813 and squashes the following commits:
    
    70c75a7 [ravipesala] Fixed styles
    713ea84 [ravipesala] Updated as per admin comments
    709684f [ravipesala] Changed parser to support case when function.

commit 4e9b551a0b807f5a2cc6679165c8be4e88a3d077
Author: Josh Rosen <[email protected]>
Date:   2014-10-09T23:08:07Z

    [SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython support 
improvements:
    
    This pull request addresses a few issues related to PySpark's IPython 
support:
    
    - Fix the remaining uses of the '-u' flag, which IPython doesn't support 
(see SPARK-3772).
    - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old 
name is reserved in case we ever want to allow the worker Python options to be 
customized (this variable was introduced in #2554 and hasn't landed in a 
release yet, so this doesn't break any compatibility).
    - Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use 
`ipython` while the workers use a different Python version.
    - Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
    - Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid 
breaking existing example programs).
    
    There are more details in a block comment in `bin/pyspark`.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:
    
    7b8eb86 [Josh Rosen] More changes to PySpark python executable 
configuration:
    c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark 
workers; IPython fixes:

commit 2837bf8548db7e9d43f6eefedf5a73feb22daedb
Author: Michael Armbrust <[email protected]>
Date:   2014-10-10T00:54:02Z

    [SPARK-3798][SQL] Store the output of a generator in a val
    
    This prevents it from changing during serialization, leading to corrupted 
results.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #2656 from marmbrus/generateBug and squashes the following commits:
    
    efa32eb [Michael Armbrust] Store the output of a generator in a val. This 
prevents it from changing during serialization.

commit 363baacaded56047bcc63276d729ab911e0336cf
Author: Sean Owen <[email protected]>
Date:   2014-10-10T01:21:59Z

    SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, 
Utils.createTempDir
    
    I noticed a few issues with how temp directories are created and deleted:
    
    *Minor*
    
    * Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in 
many tests to make a temp dir, but `Utils.createTempDir()` seems to be the 
standard Spark mechanism
    * Call to `File.deleteOnExit()` could be pushed into 
`Utils.createTempDir()` as well, along with this replacement
    * _I messed up the message in an exception in `Utils` in SPARK-3794; fixed 
here_
    
    *Bit Less Minor*
    
    * `Utils.deleteRecursively()` fails immediately if any `IOException` 
occurs, instead of trying to delete any remaining files and subdirectories. 
I've observed this leave temp dirs around. I suggest changing it to continue in 
the face of an exception and throw one of the possibly several exceptions that 
occur at the end.
    * `Utils.createTempDir()` will add a JVM shutdown hook every time the 
method is called. Even if the subdir is the parent of another parent dir, since 
this check is inside the hook. However `Utils` manages a set of all dirs to 
delete on shutdown already, called `shutdownDeletePaths`. A single hook can be 
registered to delete all of these on exit. This is how Tachyon temp paths are 
cleaned up in `TachyonBlockManager`.
    
    I noticed a few other things that might be changed but wanted to ask first:
    
    * Shouldn't the set of dirs to delete be `File`, not just `String` paths?
    * `Utils` manages the set of `TachyonFile` that have been registered for 
deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should 
this logic not live together, and not in `Utils`? it's more specific to 
Tachyon, and looks a slight bit odd to import in such a generic place.
    
    Author: Sean Owen <[email protected]>
    
    Closes #2670 from srowen/SPARK-3811 and squashes the following commits:
    
    071ae60 [Sean Owen] Update per @vanzin's review
    da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths 
even when an exception occurs; use one shutdown hook instead of one per method 
call to delete temp dirs
    3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of 
Files.createTempDir

commit edf02da389f75df5a42465d41f035d6b65599848
Author: Cheng Lian <[email protected]>
Date:   2014-10-10T01:25:06Z

    [SPARK-3654][SQL] Unifies SQL and HiveQL parsers
    
    This PR is a follow up of #2590, and tries to introduce a top level SQL 
parser entry point for all SQL dialects supported by Spark SQL.
    
    A top level parser `SparkSQLParser` is introduced to handle the syntaxes 
that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and 
`SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it 
fallbacks to a specified function that tries to parse arbitrary input to a 
`LogicalPlan`. This function is typically another parser combinator like 
`SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.
    
    The `ExtendedHiveQlParser` now only handle Hive specific extensions.
    
    Also took the chance to refactor/reformat `SqlParser` for better 
readability.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #2698 from liancheng/gen-sql-parser and squashes the following 
commits:
    
    ceada76 [Cheng Lian] Minor styling fixes
    9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in 
the parser
    bb2ab12 [Cheng Lian] SET property value can be empty string
    ce8860b [Cheng Lian] Passes test suites
    e86968e [Cheng Lian] Removes debugging code
    8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking 
doesn't like it)
    d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers

commit 421382d0e728940caa3e61bc11237c61f256378a
Author: Cheng Lian <[email protected]>
Date:   2014-10-10T01:26:43Z

    [SPARK-3824][SQL] Sets in-memory table default storage level to 
MEMORY_AND_DISK
    
    Using `MEMORY_AND_DISK` as default storage level for in-memory table 
caching. Due to the in-memory columnar representation, recomputing an in-memory 
cached table partitions can be very expensive.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #2686 from liancheng/spark-3824 and squashes the following commits:
    
    35d2ed0 [Cheng Lian] Removes extra space
    1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
    ba565f0 [Cheng Lian] Maks CachedBatch serializable
    07f0204 [Cheng Lian] Sets in-memory table default storage level to 
MEMORY_AND_DISK

commit 6f98902a3d7749e543bc493a8c62b1e3a7b924cc
Author: ravipesala <[email protected]>
Date:   2014-10-10T01:41:36Z

    [SPARK-3834][SQL] Backticks not correctly handled in subquery aliases
    
    The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not 
work as backticks in subquery aliases are not handled properly. This PR fixes 
that.
    
    Author : ravipesala ravindra.pesalahuawei.com
    
    Author: ravipesala <[email protected]>
    
    Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:
    
    0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases

commit 411cf29fff011561f0093bb6101af87842828369
Author: Anand Avati <[email protected]>
Date:   2014-10-10T07:46:56Z

    [SPARK-2805] Upgrade Akka to 2.3.4
    
    This is a second rev of the Akka upgrade (earlier merged, but reverted). I 
made a slight modification which is that I also upgrade Hive to deal with a 
compatibility issue related to the protocol buffers library.
    
    Author: Anand Avati <[email protected]>
    Author: Patrick Wendell <[email protected]>
    
    Closes #2752 from pwendell/akka-upgrade and squashes the following commits:
    
    4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
    57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on 
akka.actor.IO
    2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4

commit 90f73fcc47c7bf881f808653d46a9936f37c3c31
Author: Aaron Davidson <[email protected]>
Date:   2014-10-10T08:44:36Z

    [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in 
ConnectionManager
    
    In general, individual shuffle blocks are frequently small, so mmapping 
them often creates a lot of waste. It may not be bad to mmap the larger ones, 
but it is pretty inconvenient to get configuration into ManagedBuffer, and 
besides it is unlikely to help all that much.
    
    Author: Aaron Davidson <[email protected]>
    
    Closes #2742 from aarondav/mmap and squashes the following commits:
    
    a152065 [Aaron Davidson] Add other pathway back
    52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not 
mmapping files in ConnectionManager

commit 72f36ee571ad27c7c7c70bb9aecc7e6ef51dfd44
Author: Davies Liu <[email protected]>
Date:   2014-10-10T21:14:05Z

    [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
    
    Use AutoBatchedSerializer by default, which will choose the proper batch 
size based on size of serialized objects, let the size of serialized batch fall 
in into  [64k - 640k].
    
    In JVM, the serializer will also track the objects in batch to figure out 
duplicated objects, larger batch may cause OOM in JVM.
    
    Author: Davies Liu <[email protected]>
    
    Closes #2740 from davies/batchsize and squashes the following commits:
    
    52cdb88 [Davies Liu] update docs
    185f2b9 [Davies Liu] use AutoBatchedSerializer by default

commit 1d72a30874a88bdbab75217f001cf2af409016e7
Author: Patrick Wendell <[email protected]>
Date:   2014-10-10T23:49:19Z

    HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
    
    We had to upgrade our Hive 0.12 version as well to deal with a protobuf
    conflict (both hive and akka have been using a shaded protobuf version).
    This is testing a correctly patched version of Hive 0.12.
    
    Author: Patrick Wendell <[email protected]>
    
    Closes #2756 from pwendell/hotfix and squashes the following commits:
    
    cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.

commit 0e8203f4fb721158fb27897680da476174d24c4b
Author: Prashant Sharma <[email protected]>
Date:   2014-10-11T01:39:55Z

    [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden 
alternatives, can have default argument(s).
    
    ...riden alternatives, can have default argument.
    
    Author: Prashant Sharma <[email protected]>
    
    Closes #2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes 
the following commits:
    
    d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one 
function/ctor amongst overriden alternatives, can have default argument.

commit 81015a2ba49583d730ce65b2262f50f1f2451a79
Author: cocoatomo <[email protected]>
Date:   2014-10-11T18:26:17Z

    [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 
and unittest2 is not installed
    
    ./python/run-tests search a Python 2.6 executable on PATH and use it if 
available.
    When using Python 2.6, it is going to import unittest2 module which is not 
a standard library in Python 2.6, so it fails with ImportError.
    
    Author: cocoatomo <[email protected]>
    
    Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes 
the following commits:
    
    f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with 
Python 2.6 and unittest2 is not installed

commit 7a3f589ef86200f99624fea8322e5af0cad774a7
Author: cocoatomo <[email protected]>
Date:   2014-10-11T18:51:59Z

    [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and 
building warnings
    
    Sphinx documents contains a corrupted ReST format and have some warnings.
    
    The purpose of this issue is same as 
https://issues.apache.org/jira/browse/SPARK-3773.
    
    commit: 0e8203f4fb721158fb27897680da476174d24c4b
    
    output
    ```
    $ cd ./python/docs
    $ make clean html
    rm -rf _build/*
    sphinx-build -b html -d _build/doctrees   . _build/html
    Making output directory...
    Running Sphinx v1.2.3
    loading pickled environment... not yet created
    building [html]: targets for 4 source files that are out of date
    updating environment: 4 added, 0 changed, 0 removed
    reading sources... [100%] pyspark.sql
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring 
of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends 
without a blank line; unexpected unindent.
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring 
of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends 
without a blank line; unexpected unindent.
    /Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of 
pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected 
unindent.
    looking for now-outdated files... none found
    pickling environment... done
    checking consistency... done
    preparing documents... done
    writing output... [100%] pyspark.sql
    writing additional files... (12 module code pages) _modules/index search
    copying static files... WARNING: html_static_path entry 
u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
    done
    copying extra files... done
    dumping search index... done
    dumping object inventory... done
    build succeeded, 4 warnings.
    
    Build finished. The HTML pages are in _build/html.
    ```
    
    Author: cocoatomo <[email protected]>
    
    Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes 
the following commits:
    
    2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx 
documents and building warnings

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Reply via email to