[GitHub] spark pull request: Branch 1.2

codeAshu Thu, 20 Nov 2014 06:42:58 -0800

GitHub user codeAshu opened a pull request:

    https://github.com/apache/spark/pull/3385


    Branch 1.2

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3385
    
----
commit a68321400c1068449698d03cebd0fbf648627133
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-11-03T20:24:24Z

    [SPARK-4148][PySpark] fix seed distribution and add some tests for 
rdd.sample
    
    The current way of seed distribution makes the random sequences from 
partition i and i+1 offset by 1.
    
    ~~~
    In [14]: import random
    
    In [15]: r1 = random.Random(10)
    
    In [16]: r1.randint(0, 1)
    Out[16]: 1
    
    In [17]: r1.random()
    Out[17]: 0.4288890546751146
    
    In [18]: r1.random()
    Out[18]: 0.5780913011344704
    
    In [19]: r2 = random.Random(10)
    
    In [20]: r2.randint(0, 1)
    Out[20]: 1
    
    In [21]: r2.randint(0, 1)
    Out[21]: 0
    
    In [22]: r2.random()
    Out[22]: 0.5780913011344704
    ~~~
    
    Note: The new tests are not for this bug fix.
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
    
    869ae4b [Xiangrui Meng] move tests tests.py
    c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for 
rdd.sample
    
    (cherry picked from commit 3cca1962207745814b9d83e791713c91b659c36c)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit fc782896b5d51161feee950107df2acf17e12422
Author: fi <code...@gmail.com>
Date:   2014-11-03T20:56:56Z

    [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1
    
    instead of `hive.version=0.13.1`.
    e.g. mvn -Phive -Phive=0.13.1
    
    Note: `hive.version=0.13.1a` is the default property value. However, when 
explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be 
selected.
    References:  PR #2685, which resolved a package incompatibility issue with 
Hive-0.13.1 by introducing a special version Hive-0.13.1a
    
    Author: fi <code...@gmail.com>
    
    Closes #3072 from coderfi/master and squashes the following commits:
    
    7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing 
`hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` 
Note: `hive.version=0.13.1a` is the default version. However, when explicitly 
specifying the `hive-0.13.1` maven profile, the wrong one would be selected. 
e.g. mvn -Phive -Phive=0.13.1 See PR #2685
    
    (cherry picked from commit df607da025488d6c924d3d70eddb67f5523080d3)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit 292da4ef25d6cce23bfde7b9ab663a574dfd2b00
Author: ravipesala <ravindra.pes...@huawei.com>
Date:   2014-11-03T21:07:41Z

    [SPARK-4207][SQL] Query which has syntax like 'not like' is not working in 
Spark SQL
    
    Queries which has 'not like' is not working spark sql.
    
    sql("SELECT * FROM records where value not like 'val%'")
     same query works in Spark HiveQL
    
    Author: ravipesala <ravindra.pes...@huawei.com>
    
    Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits:
    
    35c11e7 [ravipesala] Supported 'not like' syntax in sql
    
    (cherry picked from commit 2b6e1ce6ee7b1ba8160bcbee97f5bbff5c46ca09)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit cc5dc4247979dc001302f7af978801b789acdbfa
Author: Davies Liu <davies....@gmail.com>
Date:   2014-11-03T21:17:09Z

    [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling
    
    This patch will try to infer schema for RDD which has empty value (None, 
[], {}) in the first row. It will try first 100 rows and merge the types into 
schema, also merge fields of StructType together. If there is still NullType in 
schema, then it will show an warning, tell user to try with sampling.
    
    If sampling is presented, it will infer schema from all the rows after 
sampling.
    
    Also, add samplingRatio for jsonFile() and jsonRDD()
    
    Author: Davies Liu <davies....@gmail.com>
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #2716 from davies/infer and squashes the following commits:
    
    e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
infer
    34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
infer
    567dc60 [Davies Liu] update docs
    9767b27 [Davies Liu] Merge branch 'master' into infer
    e48d7fb [Davies Liu] fix tests
    29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
    ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
infer
    540d1d5 [Davies Liu] merge fields for StructType
    f93fd84 [Davies Liu] add more tests
    3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by 
sampling the RDD
    
    (cherry picked from commit 24544fbce05665ab4999a1fe5aac434d29cd912c)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit 572300ba8a5f24b52f19d7033a456248da20bfed
Author: Cheng Lian <l...@databricks.com>
Date:   2014-11-03T21:20:33Z

    [SPARK-4202][SQL] Simple DSL support for Scala UDF
    
    This feature is based on an offline discussion with mengxr, hopefully can 
be useful for the new MLlib pipeline API.
    
    For the following test snippet
    
    ```scala
    case class KeyValue(key: Int, value: String)
    val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, 
i.toString)).toSchemaRDD
    def foo(a: Int, b: String) => a.toString + b
    ```
    
    the newly introduced DSL enables the following syntax
    
    ```scala
    import org.apache.spark.sql.catalyst.dsl._
    testData.select(Star(None), foo.call('key, 'value) as 'result)
    ```
    
    which is equivalent to
    
    ```scala
    testData.registerTempTable("testData")
    sqlContext.registerFunction("foo", foo)
    sql("SELECT *, foo(key, value) AS result FROM testData")
    ```
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #3067 from liancheng/udf-dsl and squashes the following commits:
    
    f132818 [Cheng Lian] Adds DSL support for Scala UDF
    
    (cherry picked from commit c238fb423d1011bd1b1e6201d769b72e52664fc6)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit 6104754f711da9eb0c09daf377bcd750d2d23f8a
Author: Cheng Hao <hao.ch...@intel.com>
Date:   2014-11-03T21:59:43Z

    [SPARK-4152] [SQL] Avoid data change in CTAS while table already existed
    
    CREATE TABLE t1 (a String);
    CREATE TABLE t1 AS SELECT key FROM src; â throw exception
    CREATE TABLE if not exists t1 AS SELECT key FROM src; â expect do 
nothing, currently it will overwrite the t1, which is incorrect.
    
    Author: Cheng Hao <hao.ch...@intel.com>
    
    Closes #3013 from chenghao-intel/ctas_unittest and squashes the following 
commits:
    
    194113e [Cheng Hao] fix bug in CTAS when table already existed
    
    (cherry picked from commit e83f13e8d37ca33f4e183e977d077221b90c6025)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit 51985f78ca5f728f8b9233b703110f541d27b274
Author: Michael Armbrust <mich...@databricks.com>
Date:   2014-11-03T22:08:27Z

    [SQL] More aggressive defaults
    
     - Turns on compression for in-memory cached data by default
     - Changes the default parquet compression format back to gzip (we have 
seen more OOMs with production workloads due to the way Snappy allocates memory)
     - Ups the batch size to 10,000 rows
     - Increases the broadcast threshold to 10mb.
     - Uses our parquet implementation instead of the hive one by default.
     - Cache parquet metadata by default.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #3064 from marmbrus/fasterDefaults and squashes the following 
commits:
    
    97ee9f8 [Michael Armbrust] parquet codec docs
    e641694 [Michael Armbrust] Remote also
    a12866a [Michael Armbrust] Cache metadata.
    2d73acc [Michael Armbrust] Update docs defaults.
    d63d2d5 [Michael Armbrust] document parquet option
    da373f9 [Michael Armbrust] More aggressive defaults
    
    (cherry picked from commit 25bef7e6951301e93004567fc0cef96bf8d1a224)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit fa86d862f98cfea3d9afff6e61b3141c9b08f949
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-11-03T23:19:01Z

    SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta...
    
    ...ntiation
    
    Author: Sandy Ryza <sa...@cloudera.com>
    
    Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits:
    
    8d2e70e [Sandy Ryza] Kostas's review feedback
    e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in 
RecordReader instantiation
    
    (cherry picked from commit 28128150e7e0c2b7d1c483e67214bdaef59f7d75)
    Signed-off-by: Patrick Wendell <pwend...@gmail.com>

commit 52db2b9429e00d8ed398a2432ad6a26cd1e5920c
Author: Michael Armbrust <mich...@databricks.com>
Date:   2014-11-04T02:04:51Z

    [SQL] Convert arguments to Scala UDFs
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits:
    
    34b5f27 [Michael Armbrust] style
    504adef [Michael Armbrust] Convert arguments to Scala UDFs
    
    (cherry picked from commit 15b58a2234ab7ba30c9c0cbb536177a3c725e350)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit 0826eed9c84a73544e3d8289834c8b5ebac47e03
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-11-04T02:50:37Z

    [FIX][MLLIB] fix seed in BaggedPointSuite
    
    Saw Jenkins test failures due to random seeds.
    
    jkbradley manishamde
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following 
commits:
    
    f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite
    
    (cherry picked from commit c5912ecc7b392a13089ae735c07c2d7256de36c6)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 42d02db86cd973cf31ceeede0c5a723238bbe746
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-11-04T03:29:11Z

    [SPARK-4192][SQL] Internal API for Python UDT
    
    Following #2919, this PR adds Python UDT (for internal use only) with tests 
under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we 
need to convert user-type instances into SQL recognizable data. In the current 
implementation, a Python UDT must be paired with a Scala UDT for serialization 
on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and 
Python.
    
    marmbrus jkbradley davies
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
    
    acff637 [Xiangrui Meng] merge master
    dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as 
well
    2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
    7c4a6a9 [Xiangrui Meng] address comments
    75223db [Xiangrui Meng] minor update
    f740379 [Xiangrui Meng] remove UDT from default imports
    e98d9d0 [Xiangrui Meng] fix py style
    4e84fce [Xiangrui Meng] remove local hive tests and add more tests
    39f19e0 [Xiangrui Meng] add tests
    b7f666d [Xiangrui Meng] add Python UDT
    
    (cherry picked from commit 04450d11548cfb25d4fb77d4a33e3a7cd4254183)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 8395e8fbdf23bef286ec68a4bbadcc448b504c2c
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-11-04T06:29:48Z

    [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD
    
    Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and 
Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and 
then select columns or save to a Parquet file. Examples in Scala/Python are 
attached. The Scala code was copied from jkbradley.
    
    ~~This PR contains the changes from #3068 . I will rebase after #3068 is 
merged.~~
    
    marmbrus jkbradley
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
    
    3a0b6e5 [Xiangrui Meng] organize imports
    236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
    
    (cherry picked from commit 1a9c6cddadebdc53d083ac3e0da276ce979b5d1f)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 786e75b33f0bc1445bfc289fe4b62407cb79026e
Author: Davies Liu <dav...@databricks.com>
Date:   2014-11-04T07:56:14Z

    [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by 
default.
    
    This PR simplify serializer, always use batched serializer 
(AutoBatchedSerializer as default), even batch size is 1.
    
    Author: Davies Liu <dav...@databricks.com>
    
    This patch had conflicts when merged, resolved by
    Committer: Josh Rosen <joshro...@databricks.com>
    
    Closes #2920 from davies/fix_autobatch and squashes the following commits:
    
    e544ef9 [Davies Liu] revert unrelated change
    6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_autobatch
    1d557fc [Davies Liu] fix tests
    8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_autobatch
    76abdce [Davies Liu] clean up
    53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_autobatch
    d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_autobatch
    2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_autobatch
    b4292ce [Davies Liu] fix bug in master
    d79744c [Davies Liu] recover hive tests
    be37ece [Davies Liu] refactor
    eb3938d [Davies Liu] refactor serializer in scala
    8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by 
default.
    
    (cherry picked from commit e4f42631a68b473ce706429915f3f08042af2119)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 4b13bff939291caa1fb9b9a180db66b1d006153c
Author: Dariusz Kobylarz <darek.kobyl...@gmail.com>
Date:   2014-11-04T17:53:43Z

    fixed MLlib Naive-Bayes java example bug
    
    the filter tests Double objects by references whereas it should test their 
values
    
    Author: Dariusz Kobylarz <darek.kobyl...@gmail.com>
    
    Closes #3081 from dkobylarz/master and squashes the following commits:
    
    5d43a39 [Dariusz Kobylarz] naive bayes example update
    a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug
    
    (cherry picked from commit bcecd73fdd4d2ec209259cfd57d3ad1d63f028f2)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit b90451814b7ff7338881e60124d779e2fd89ac60
Author: Niklas Wilcke <1wil...@informatik.uni-hamburg.de>
Date:   2014-11-04T17:57:03Z

    [Spark-4060] [MLlib] exposing special rdd functions to the public
    
    Author: Niklas Wilcke <1wil...@informatik.uni-hamburg.de>
    
    Closes #2907 from numbnut/master and squashes the following commits:
    
    7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions 
to the public, #2907
    
    (cherry picked from commit f90ad5d426cb726079c490a9bb4b1100e2b4e602)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit e5c7869f20139832ad9e636eaeb5e77da7297456
Author: Michael Armbrust <mich...@databricks.com>
Date:   2014-11-05T02:14:28Z

    [SQL] Add String option for DSL AS
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #3097 from marmbrus/asString and squashes the following commits:
    
    6430520 [Michael Armbrust] Add String option for DSL AS
    
    (cherry picked from commit 515abb9afa2d6b58947af6bb079a493b49d315ca)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit f225b3cc18698b2ee8a94c8ffa0b6aca2fce7cf9
Author: Davies Liu <dav...@databricks.com>
Date:   2014-11-05T05:35:52Z

    [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
    
    ```
    pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
        :: Experimental ::
    
        If `observed` is Vector, conduct Pearson's chi-squared goodness
        of fit test of the observed data against the expected distribution,
        or againt the uniform distribution (by default), with each category
        having an expected frequency of `1 / len(observed)`.
        (Note: `observed` cannot contain negative values)
    
        If `observed` is matrix, conduct Pearson's independence test on the
        input contingency matrix, which cannot contain negative entries or
        columns or rows that sum up to 0.
    
        If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
        test for every feature against the label across the input RDD.
        For each feature, the (feature, label) pairs are converted into a
        contingency matrix for which the chi-squared statistic is computed.
        All label and feature values must be categorical.
    
        :param observed: it could be a vector containing the observed 
categorical
                         counts/relative frequencies, or the contingency matrix
                         (containing either counts or relative frequencies),
                         or an RDD of LabeledPoint containing the labeled 
dataset
                         with categorical features. Real-valued features will be
                         treated as categorical for each distinct value.
        :param expected: Vector containing the expected categorical 
counts/relative
                         frequencies. `expected` is rescaled if the `expected` 
sum
                         differs from the `observed` sum.
        :return: ChiSquaredTest object containing the test statistic, degrees
                 of freedom, p-value, the method used, and the null hypothesis.
    ```
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #3091 from davies/his and squashes the following commits:
    
    145d16c [Davies Liu] address comments
    0ab0764 [Davies Liu] fix float
    5097d54 [Davies Liu] add Hypothesis test Python API
    
    (cherry picked from commit c8abddc5164d8cf11cdede6ab3d5d1ea08028708)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 46654b0661257f432932c6efc09c4c0983521834
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2014-11-05T09:21:53Z

    [SPARK-4029][Streaming] Update streaming driver to reliably save and 
recover received block metadata on driver failures
    
    As part of the initiative of preventing data loss on driver failure, this 
JIRA tracks the sub task of modifying the streaming driver to reliably save 
received block metadata, and recover them on driver restart.
    
    This was solved by introducing a `ReceivedBlockTracker` that takes all the 
responsibility of managing the metadata of received blocks (i.e. 
`ReceivedBlockInfo`, and any actions on them (e.g, allocating blocks to 
batches, etc.). All actions to block info get written out to a write ahead log 
(using `WriteAheadLogManager`). On recovery, all the actions are replaying to 
recreate the pre-failure state of the `ReceivedBlockTracker`, which include the 
batch-to-block allocations and the unallocated blocks.
    
    Furthermore, the `ReceiverInputDStream` was modified to create 
`WriteAheadLogBackedBlockRDD`s when file segment info is present in the 
`ReceivedBlockInfo`. After recovery of all the block info (through recovery 
`ReceivedBlockTracker`), the `WriteAheadLogBackedBlockRDD`s gets recreated with 
the recovered info, and jobs submitted. The data of the blocks gets pulled from 
the write ahead logs, thanks to the segment info present in the 
`ReceivedBlockInfo`.
    
    This is still a WIP. Things that are missing here are.
    
    - *End-to-end integration tests:* Unit tests that tests the driver 
recovery, by killing and restarting the streaming context, and verifying all 
the input data gets processed. This has been implemented but not included in 
this PR yet. A sneak peek of that DriverFailureSuite can be found in this PR 
(on my personal repo): https://github.com/tdas/spark/pull/25 I can either 
include it in this PR, or submit that as a separate PR after this gets in.
    
    - *WAL cleanup:* Cleaning up the received data write ahead log, by calling 
`ReceivedBlockHandler.cleanupOldBlocks`. This is being worked on.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #3026 from tdas/driver-ha-rbt and squashes the following commits:
    
    a8009ed [Tathagata Das] Added comment
    1d704bb [Tathagata Das] Enabled storing recovered WAL-backed blocks to BM
    2ee2484 [Tathagata Das] More minor changes based on PR
    47fc1e3 [Tathagata Das] Addressed PR comments.
    9a7e3e4 [Tathagata Das] Refactored ReceivedBlockTracker API a bit to make 
things a little cleaner for users of the tracker.
    af63655 [Tathagata Das] Minor changes.
    fce2b21 [Tathagata Das] Removed commented lines
    59496d3 [Tathagata Das] Changed class names, made allocation more explicit 
and added cleanup
    19aec7d [Tathagata Das] Fixed casting bug.
    f66d277 [Tathagata Das] Fix line lengths.
    cda62ee [Tathagata Das] Added license
    25611d6 [Tathagata Das] Minor changes before submitting PR
    7ae0a7fb [Tathagata Das] Transferred changes from driver-ha-working branch
    
    (cherry picked from commit 5f13759d3642ea5b58c12a756e7125ac19aff10e)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 9cba88c7f9fdf151217716e4cc5fa75995736922
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2014-11-05T18:33:13Z

    [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, 
Java
    
    ### Summary
    
    * Made it easier to construct default Strategy and BoostingStrategy and to 
set parameters using simple types.
    * Added Scala and Java examples for GradientBoostedTrees
    * small cleanups and fixes
    
    ### Details
    
    GradientBoosting bug fixes (âbugâ = bad default options)
    * Force boostingStrategy.weakLearnerParams.algo = Regression
    * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
    * Only persist data if not yet persisted (since it causes an error if 
persisted twice)
    
    BoostingStrategy
    * numEstimators: renamed to numIterations
    * removed subsamplingRate (duplicated by Strategy)
    * removed categoricalFeaturesInfo since it belongs with the weak learner 
params (since boosting can be oblivious to feature type)
    * Changed algo to var (not val) and added BeanProperty, with overload 
taking String argument
    * Added assertValid() method
    * Updated defaultParams() method and eliminated defaultWeakLearnerParams() 
since that belongs in Strategy
    
    Strategy (for DecisionTree)
    * Changed algo to var (not val) and added BeanProperty, with overload 
taking String argument
    * Added setCategoricalFeaturesInfo method taking Java Map.
    * Cleaned up assertValid
    * Changed valâs to defâs since parameters can now be changed.
    
    CC: manishamde mengxr codedeft
    
    Author: Joseph K. Bradley <jos...@databricks.com>
    
    Closes #3094 from jkbradley/gbt-api and squashes the following commits:
    
    7a27e22 [Joseph K. Bradley] scalastyle fix
    52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into gbt-api
    e9b8410 [Joseph K. Bradley] Summary of changes
    
    (cherry picked from commit 5b3b6f6f5f029164d7749366506e142b104c1d43)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 236434033fe452e70dbd0236935a49693712e130
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-11-05T00:15:38Z

    [SPARK-2938] Support SASL authentication in NettyBlockTransferService
    
    Also lays the groundwork for supporting it inside the external shuffle 
service.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #3087 from aarondav/sasl and squashes the following commits:
    
    3481718 [Aaron Davidson] Delete rogue println
    44f8410 [Aaron Davidson] Delete documentation - muahaha!
    eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at 
Spark-level
    a6b95f1 [Aaron Davidson] Address comments
    785bbde [Aaron Davidson] Cleanup
    79973cb [Aaron Davidson] Remove unused file
    151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
    f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
    7b42adb [Aaron Davidson] Add unit tests
    8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in 
NettyBlockTransferService

commit e7f735637ad2f681b454d1297f6fdcc433feebbc
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-11-05T22:38:43Z

    [SPARK-4242] [Core] Add SASL to external shuffle service
    
    Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts 
SecurityManager in BlockManager's constructor, and (3) adds unit test.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #3108 from aarondav/sasl-client and squashes the following commits:
    
    48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
    3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
    b58518a [Aaron Davidson] ByteStreams.limit() not available :(
    cbe451a [Aaron Davidson] Address comments
    2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle 
service

commit 866c7bbe56f9c7fd96d3f4afe8a76405dc877a6e
Author: Josh Rosen <joshro...@databricks.com>
Date:   2014-11-04T02:18:47Z

    [SPARK-611] Display executor thread dumps in web UI
    
    This patch allows executor thread dumps to be collected on-demand and 
viewed in the Spark web UI.
    
    The thread dumps are collected using Thread.getAllStackTraces().  To allow 
remote thread dumps to be triggered from the web UI, I added a new 
`ExecutorActor` that runs inside of the Executor actor system and responds to 
RPCs from the driver.  The driver's mechanism for obtaining a reference to this 
actor is a little bit hacky: it uses the block manager master actor to 
determine the host/port of the executor actor systems in order to construct 
ActorRefs to ExecutorActor.  Unfortunately, I couldn't find a much cleaner way 
to do this without a big refactoring of the executor -> driver communication.
    
    Screenshots:
    
    
![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png)
    
    
![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png)
    
    
![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png)
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following 
commits:
    
    3c21a5d [Josh Rosen] Address review comments:
    880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
jstack-in-web-ui
    f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
jstack-in-web-ui
    19707b0 [Josh Rosen] Add one comment.
    127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER
    b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
jstack-in-web-ui
    3dfc2d4 [Josh Rosen] Add missing file.
    bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach.
    f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps
    dfec08b [Josh Rosen] Add option to disable thread dumps in UI.
    4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps.
    2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in 
non-local mode.
    cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite.
    87b8b65 [Josh Rosen] Add new listener event for thread dumps.
    8c10216 [Josh Rosen] Add missing file.
    0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI

commit 7517c37aee373c8bd3ccbf1eae079b0fc6b89c91
Author: Zhang, Liye <liye.zh...@intel.com>
Date:   2014-11-04T02:17:32Z

    [SPARK-4168][WebUI] web statges number should show correctly when stages 
are more than 1000
    
    The number of completed stages and failed stages showed on webUI will 
always be less than 1000. This is really misleading when there are already 
thousands of stages completed or failed. The number should be correct even when 
only partial stages listed on the webUI (stage info will be removed if the 
number is too large).
    
    Author: Zhang, Liye <liye.zh...@intel.com>
    
    Closes #3035 from liyezhang556520/webStageNum and squashes the following 
commits:
    
    d9e29fb [Zhang, Liye] add detailed comments for variables
    4ea8fd1 [Zhang, Liye] change variable name accroding to comments
    f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show 
correctly when stages are more than 1000

commit e0a043b79c250515a680485f0dc7b1a149835445
Author: zsxwing <zsxw...@gmail.com>
Date:   2014-11-04T06:40:43Z

    [SPARK-4163][Core] Add a backward compatibility test for FetchFailed
    
    /cc aarondav
    
    Author: zsxwing <zsxw...@gmail.com>
    
    Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following 
commits:
    
    21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed

commit 68be37b823516dbeda066776bb060bf894db4e95
Author: zsxwing <zsxw...@gmail.com>
Date:   2014-11-04T06:47:45Z

    [SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure
    
    Author: zsxwing <zsxw...@gmail.com>
    
    Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following 
commits:
    
    89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure

commit b27d7dcaaad0bf04d341660ffbeb742cd4eecfd3
Author: Nicholas Chammas <nicholas.cham...@gmail.com>
Date:   2014-11-03T17:02:35Z

    [EC2] Factor out Mesos spark-ec2 branch
    
    We reference a specific branch in two places. This patch makes it one place.
    
    Author: Nicholas Chammas <nicholas.cham...@gmail.com>
    
    Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the 
following commits:
    
    10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch

commit f4beb77f083e477845b90b5049186095d2002f49
Author: Kay Ousterhout <kayousterh...@gmail.com>
Date:   2014-11-05T23:30:31Z

    [SPARK-3984] [SPARK-3983] Fix incorrect scheduler delay and display task 
deserialization time in UI
    
    This commit fixes the scheduler delay in the UI (which previously
    included things that are not scheduler delay, like time to
    deserialize the task and serialize the result), and also
    adds information about time to deserialize tasks to the optional
    additional metrics.  Time to deserialize the task can be large relative
    to task time for short jobs, and understanding when it is high can help
    developers realize that they should try to reduce closure size (e.g, by 
including
    less data in the task description).
    
    cc shivaram etrain
    
    Author: Kay Ousterhout <kayousterh...@gmail.com>
    
    Closes #2832 from kayousterhout/SPARK-3983 and squashes the following 
commits:
    
    0c1398e [Kay Ousterhout] Fixed ordering
    531575d [Kay Ousterhout] Removed executor launch time
    1f13afe [Kay Ousterhout] Minor spacing fixes
    335be4b [Kay Ousterhout] Made metrics hideable
    5bc3cba [Kay Ousterhout] [SPARK-3984] [SPARK-3983] Improve UI task metrics.
    
    (cherry picked from commit a46497eecc50f854c5c5701dc2b8a2468b76c085)
    Signed-off-by: Kay Ousterhout <kayousterh...@gmail.com>

commit 6844e7a8219ac78790a422ffd5054924e7d2bea1
Author: industrial-sloth <industrial-sl...@users.noreply.github.com>
Date:   2014-11-05T23:38:48Z

    SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader
    
    replaces the existing read() call with readFully().
    
    Author: industrial-sloth <industrial-sl...@users.noreply.github.com>
    
    Closes #3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes 
the following commits:
    
    a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader

commit cf2f676f93807bc504b77409b6c3d66f0d5e38ab
Author: Andrew Or <and...@databricks.com>
Date:   2014-11-05T23:42:05Z

    [SPARK-3797] Run external shuffle service in Yarn NM
    
    This creates a new module `network/yarn` that depends on `network/shuffle` 
recently created in #3001. This PR introduces a custom Yarn auxiliary service 
that runs the external shuffle service. As of the changes here this shuffle 
service is required for using dynamic allocation with Spark.
    
    This is still WIP mainly because it doesn't handle security yet. I have 
tested this on a stable Yarn cluster.
    
    Author: Andrew Or <and...@databricks.com>
    
    Closes #3082 from andrewor14/yarn-shuffle-service and squashes the 
following commits:
    
    ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
yarn-shuffle-service
    0ee67a2 [Andrew Or] Minor wording suggestions
    1c66046 [Andrew Or] Remove unused provided dependencies
    0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
yarn-shuffle-service
    6489db5 [Andrew Or] Try catch at the right places
    7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
    d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
    5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
yarn-shuffle-service
    9b6e058 [Andrew Or] Address various feedback
    f48b20c [Andrew Or] Fix tests again
    f39daa6 [Andrew Or] Do not make network-yarn an assembly module
    761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
yarn-shuffle-service
    15a5b37 [Andrew Or] Fix build for Hadoop 1.x
    baff916 [Andrew Or] Fix tests
    5bf9b7e [Andrew Or] Address a few minor comments
    5b419b8 [Andrew Or] Add missing license header
    804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
    cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
    ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
    1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop 
config
    b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
    43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux 
service
    b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
    
    (cherry picked from commit 61a5cced049a8056292ba94f23fa7bd040f50685)
    Signed-off-by: Andrew Or <and...@databricks.com>

commit fe4ead2995ab8529602090ed21941b6005a07c9d
Author: j...@apache.org <jayunit100>
Date:   2014-11-05T23:45:34Z

    SPARK-4040. Update documentation to exemplify use of local (n) value, fo...
    
    This is a minor docs update which helps to clarify the way local[n] is used 
for streaming apps.
    
    Author: j...@apache.org <jayunit100>
    
    Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits:
    
    35b5a5e [j...@apache.org] SPARK-4040: Update documentation to exemplify use 
of local (n) value.
    
    (cherry picked from commit 868cd4c3ca11e6ecc4425b972d9a20c360b52425)
    Signed-off-by: Matei Zaharia <ma...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Branch 1.2

Reply via email to