[GitHub] spark pull request: Branch 1.1

huozhanfeng Wed, 06 Aug 2014 22:23:23 -0700

GitHub user huozhanfeng opened a pull request:

    https://github.com/apache/spark/pull/1824


    Branch 1.1

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1824.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1824
    
----
commit e22110879cd149e94c9a5ca7466f787033572b15
Author: Andrew Or <[email protected]>
Date:   2014-08-02T19:11:50Z

    [HOTFIX] Do not throw NPE if spark.test.home is not set
    
    `spark.test.home` was introduced in #1734. This is fine for SBT but is 
failing maven tests. Either way it shouldn't throw an NPE.
    
    Author: Andrew Or <[email protected]>
    
    Closes #1739 from andrewor14/fix-spark-test-home and squashes the following 
commits:
    
    ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set

commit 8d6ac2b95ab48d9fffe82ef04cef3b22c2c139e0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-02T20:07:17Z

    [SPARK-2478] [mllib] DecisionTree Python API
    
    Added experimental Python API for Decision Trees.
    
    API:
    * class DecisionTreeModel
    ** predict() for single examples and RDDs, taking both feature vectors and 
LabeledPoints
    ** numNodes()
    ** depth()
    ** __str__()
    * class DecisionTree
    ** trainClassifier()
    ** trainRegressor()
    ** train()
    
    Examples and testing:
    * Added example testing classification and regression with batch 
prediction: examples/src/main/python/mllib/tree.py
    * Have also tested example usage in doc of python/pyspark/mllib/tree.py 
which tests single-example prediction with dense and sparse vectors
    
    Also: Small bug fix in python/pyspark/mllib/_common.py: In 
_linear_predictor_typecheck, changed check for RDD to use isinstance() instead 
of type() in order to catch RDD subclasses.
    
    CC mengxr manishamde
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #1727 from jkbradley/decisiontree-python-new and squashes the 
following commits:
    
    3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py 
Small updates based on github review.
    6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in 
pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to 
threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * 
Fixed small bug in loadLibSVMFile: If a data file had no features, then 
loadLibSVMFile would create a single all-zero feature.
    67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) 
returns an empty RDD instead of an error. * Removed support for calling 
predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache 
serialized RDD any more.
    aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
    fa10ea7 [Joseph K. Bradley] Small style update
    7968692 [Joseph K. Bradley] small braces typo fix
    e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
    db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into 
decisiontree-python-new
    6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method 
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins 
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is 
the bound for unordered categorical features, not ordered ones. The upper bound 
should be the arity (i.e., max value) of the feature.
    93953f1 [Joseph K. Bradley] Likely done with Python API.
    6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
    188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example 
but need to test a bit more.
    2b20c61 [Joseph K. Bradley] Small doc and style updates
    1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
    8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-bugfix
    978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-bugfix
    6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural 
syntax for functions returning Unit to explicitly writing Unit return type.
    376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit 
scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 
* In params, replace settings of maxDepth <-- maxDepth - 1
    e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    59750f8 [Joseph K. Bradley] * Updated Strategy to check 
numClassesForClassification only if algo=Classification. * Updates based on 
comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** 
Small cleanups ** tree.Node: Made recursive helper methods private, and renamed 
them.
    52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-bugfix
    f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump 
with 2 continuous variables for binary classification.  Caused problems in 
past, but fixed now.
    8e227ea [Joseph K. Bradley] Changed Strategy so it only requires 
numClassesForClassification >= 2 for classification
    cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for 
splits for continuous features.
    8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into 
decisiontree-python-new
    5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-python-new
    2283df8 [Joseph K. Bradley] 2 bug fixes.
    73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into decisiontree-bugfix
    5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: 
Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next 
commit.
    f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  
Also added toString, depth, and numNodes methods to DecisionTreeModel.
    
    (cherry picked from commit 3f67382e7c9c3f6a8f6ce124ab3fcb1a9c1a264f)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 91de0dc1654d609dc1ff8fa9a07ba18043ad61c6
Author: Yin Huai <[email protected]>
Date:   2014-08-02T20:16:41Z

    [SQL] Set outputPartitioning of BroadcastHashJoin correctly.
    
    I think we will not generate the plan triggering this bug at this moment. 
But, let me explain it...
    
    Right now, we are using `left.outputPartitioning` as the 
`outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical 
plan for cases like...
    ```sql
    SELECT l.key, count(*)
    FROM (SELECT key, count(*) as cnt
          FROM src
          GROUP BY key) l // This is buildPlan
    JOIN r // This is the streamedPlan
    ON (l.cnt = r.value)
    GROUP BY l.key
    ```
    Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we 
will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the 
`BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using 
`l.key` as the key, we will not introduce an `Exchange` for this aggregation. 
However, `r`'s outputPartitioning may not match the required distribution of 
the last `GROUP BY` and we fail to group data correctly.
    
    JIRA is being reindexed. I will create a JIRA ticket once it is back online.
    
    Author: Yin Huai <[email protected]>
    
    Closes #1735 from yhuai/BroadcastHashJoin and squashes the following 
commits:
    
    96d9cb3 [Yin Huai] Set outputPartitioning correctly.
    
    (cherry picked from commit 67bd8e3c217a80c3117a6e3853aa60fe13d08c91)
    Signed-off-by: Michael Armbrust <[email protected]>

commit bb0ac6d7c91c491a99c252e6cb4aea40efe9b190
Author: Chris Fregly <[email protected]>
Date:   2014-08-02T20:35:35Z

    [SPARK-1981] Add AWS Kinesis streaming support
    
    Author: Chris Fregly <[email protected]>
    
    Closes #1434 from cfregly/master and squashes the following commits:
    
    4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be 
more clear, removed retries around store() method
    0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back 
into extras/kinesis-asl
    691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with 
JavaKinesisWordCount during union of streams
    0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, 
updated docs
    e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples 
into the examples/ dir
    d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the 
KinesisUtils api
    912640c [Chris Fregly] changed the foundKinesis class to be a 
publically-avail class
    db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and 
kinesis client
    338997e [Chris Fregly] improve build docs for kinesis
    828f8ae [Chris Fregly] more cleanup
    e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    cd68c0d [Chris Fregly] fixed typos and backward compatibility
    d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
    
    (cherry picked from commit 91f9504e6086fac05b40545099f9818949c24bca)
    Signed-off-by: Tathagata Das <[email protected]>

commit 7924d72cf8aae945d72f355c54c4fcb3d62e6c48
Author: GuoQiang Li <[email protected]>
Date:   2014-08-02T20:55:28Z

    SPARK-2804: Remove scalalogging-slf4j dependency
    
    This also Closes #1701.
    
    Author: GuoQiang Li <[email protected]>
    
    Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
    
    422646b [GuoQiang Li] Remove scalalogging-slf4j dependency

commit 3b9f25f4259b254f3faa2a7d61e547089a69c259
Author: Michael Armbrust <[email protected]>
Date:   2014-08-02T23:33:48Z

    [SPARK-2097][SQL] UDF Support
    
    This patch adds the ability to register lambda functions written in Python, 
Java or Scala as UDFs for use in SQL or HiveQL.
    
    Scala:
    ```scala
    registerFunction("strLenScala", (_: String).length)
    sql("SELECT strLenScala('test')")
    ```
    Python:
    ```python
    sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
    sqlCtx.sql("SELECT strLenPython('test')")
    ```
    Java:
    ```java
    sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() 
{
      Override
      public Integer call(String str) throws Exception {
        return str.length();
      }
    }, DataType.IntegerType);
    
    sqlContext.sql("SELECT stringLengthJava('test')");
    ```
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #1063 from marmbrus/udfs and squashes the following commits:
    
    9eda0fe [Michael Armbrust] newline
    747c05e [Michael Armbrust] Add some scala UDF tests.
    d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' 
into udfs
    005d684 [Michael Armbrust] Fix naming and formatting.
    d14dac8 [Michael Armbrust] Fix last line of autogened java files.
    8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
    40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' 
into udfs
    6a36890 [Michael Armbrust] Switch logging so that SQLContext can be 
serializable.
    7a83101 [Michael Armbrust] Drop toString
    795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
    e54fb45 [Michael Armbrust] Docs and tests.
    437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, 
address review comments.
    01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into udfs
    8e6c932 [Michael Armbrust] WIP
    3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into udfs
    6237c8d [Michael Armbrust] WIP
    2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support 
for Java UDFs.
    0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and 
Python.
    
    (cherry picked from commit 158ad0bba9382fd494b4789b5628a9cec00cfa19)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 4230df4e1d6c59dc3405f46f5edf18c3825a5447
Author: Michael Armbrust <[email protected]>
Date:   2014-08-02T23:48:07Z

    [SPARK-2785][SQL] Remove assertions that throw when users try unsupported 
Hive commands.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #1742 from marmbrus/asserts and squashes the following commits:
    
    5182d54 [Michael Armbrust] Remove assertions that throw when users try 
unsupported Hive commands.
    
    (cherry picked from commit 198df11f1a9f419f820f47eba0e9f2ab371a824b)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 460fad817da1fb6619d2456f637c1b7c7f5e8c7c
Author: Cheng Lian <[email protected]>
Date:   2014-08-03T00:12:49Z

    [SPARK-2729][SQL] Added test case for SPARK-2729
    
    This is a follow up of #1636.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #1738 from liancheng/test-for-spark-2729 and squashes the following 
commits:
    
    b13692a [Cheng Lian] Added test case for SPARK-2729
    
    (cherry picked from commit 866cf1f822cfda22294054be026ef2d96307eb75)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 5ef828273deb4713a49700c56d51bdd980917cfd
Author: Yin Huai <[email protected]>
Date:   2014-08-03T00:55:22Z

    [SPARK-2797] [SQL] SchemaRDDs don't support unpersist()
    
    The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
    
    Author: Yin Huai <[email protected]>
    
    Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
    
    7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called 
without the input parameter (blocking) from PySpark.
    
    (cherry picked from commit d210022e96804e59e42ab902e53637e50884a9ab)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 5b30e001839a29e6c4bd1fc24bfa12d9166ef10c
Author: Michael Armbrust <[email protected]>
Date:   2014-08-03T01:27:04Z

    [SPARK-2739][SQL] Rename registerAsTable to registerTempTable
    
    There have been user complaints that the difference between 
`registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by 
renaming `registerAsTable` to `registerTempTable`, which more clearly reflects 
what is happening.  `registerAsTable` remains, but will cause a deprecation 
warning.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #1743 from marmbrus/registerTempTable and squashes the following 
commits:
    
    d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' 
into registerTempTable
    4dff086 [Michael Armbrust] Fix .java files too
    89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' 
into registerTempTable
    0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
    
    (cherry picked from commit 1a8043739dc1d9435def6ea3c6341498ba52b708)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 0d47bb642f645c3c8663f4bdf869b5337ef9cb35
Author: Sean Owen <[email protected]>
Date:   2014-08-03T04:44:19Z

    SPARK-2602 [BUILD] Tests steal focus under Java 6
    
    As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be 
resolved for Java 6 with the java.awt.headless system property, which never 
hurt anyone running a command line app. I tested it and seemed to get rid of 
focus stealing.
    
    Author: Sean Owen <[email protected]>
    
    Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
    
    b141018 [Sean Owen] Set java.awt.headless during tests
    (cherry picked from commit 33f167d762483b55d5d874dcc1e3075f661d4375)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit c137928cbe74446254fdbd656c50c1a1c8930094
Author: Sean Owen <[email protected]>
Date:   2014-08-03T04:55:56Z

    SPARK-2414 [BUILD] Add LICENSE entry for jquery
    
    The JIRA concerned removing jquery, and this does not remove jquery. While 
it is distributed by Spark it should have an accompanying line in LICENSE, very 
technically, as per http://www.apache.org/dev/licensing-howto.html
    
    Author: Sean Owen <[email protected]>
    
    Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
    
    2fdb03c [Sean Owen] Add LICENSE entry for jquery
    (cherry picked from commit 9cf429aaf529e91f619910c33cfe46bf33a66982)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit fb2a2079fa10ea8f338d68945a94238dda9fbd66
Author: Andrew Or <[email protected]>
Date:   2014-08-03T05:00:46Z

    [Minor] Fixes on top of #1679
    
    Minor fixes on top of #1679.
    
    Author: Andrew Or <[email protected]>
    
    Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
    
    3b46f5e [Andrew Or] Minor fixes
    (cherry picked from commit 3dc55fdf450b4237f7c592fce56d1467fd206366)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit 1992175fd93f0239e5a09e0b8db99ad9af7f380c
Author: Stephen Boesch <[email protected]>
Date:   2014-08-03T17:19:04Z

    SPARK-2712 - Add a small note to maven doc that mvn package must happen ...
    
    Per request by Reynold adding small note about proper sequencing of build 
then test.
    
    Author: Stephen Boesch <[email protected]>
    
    Closes #1615 from javadba/docs and squashes the following commits:
    
    6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
    5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that 
mvn package must happen before test
    (cherry picked from commit f8cd143b6b1b4d8aac87c229e5af263b0319b3ea)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit 162fc9512018e0c592b3aaa29d405f511461795a
Author: Allan Douglas R. de Oliveira <[email protected]>
Date:   2014-08-03T17:25:59Z

    SPARK-2246: Add user-data option to EC2 scripts
    
    Author: Allan Douglas R. de Oliveira <[email protected]>
    
    Closes #1186 from douglaz/spark_ec2_user_data and squashes the following 
commits:
    
    94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
    (cherry picked from commit a0bcbc159e89be868ccc96175dbf1439461557e1)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit eaa93555a7f935b00a2f94a7fa50a12e11578bd7
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-03T17:36:52Z

    [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
    
    Bug fix: Before, when an RDD was created in Java and passed to 
DecisionTree.train(), the fake class tag caused problems.
    * Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from 
Java.
    
    Other improvements to Decision Trees for easy-of-use with Java:
    * impurity classes: Added instance() methods to help with Java interface.
    * Strategy: Added Java-friendly constructor
    --> Note: I removed quantileCalculationStrategy from the Java-friendly 
constructor since (a) it is a special class and (b) there is only 1 option 
currently.  I suspect we will redo the API before the other options are 
included.
    
    CC: mengxr
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
    
    0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead 
of JavaConversions
    519b1b7 [Joseph K. Bradley] * Organized imports in 
JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in 
DecisionTreeSuite.scala
    f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run 
DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow 
passing RDDs from Java. * impurity classes: Added instance() methods to help 
with Java interface. * Strategy: Added Java-friendly constructor ** Note: I 
removed quantileCalculationStrategy from the Java-friendly constructor since 
(a) it is a special class and (b) there is only 1 option currently.  I suspect 
we will redo the API before the other options are included.
    d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-java
    320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
    13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' 
into dt-java
    f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated 
later
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method 
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins 
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is 
the bound for unordered categorical features, not ordered ones. The upper bound 
should be the arity (i.e., max value) of the feature.
    
    (cherry picked from commit 2998e38a942351974da36cb619e863c6f0316e7a)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit c5ed1deba6b3f3e597554a8d0f93f402ae62fab9
Author: Michael Armbrust <[email protected]>
Date:   2014-08-03T19:28:29Z

    [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 
'spark.sql.dialect'
    
    Many users have reported being confused by the distinction between the 
`sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot 
be used to read hive tables.  In this PR I introduce a new configuration option 
`spark.sql.dialect` that picks which dialect with be used for parsing.  For 
SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` 
but can also be set to `sql`.
    
    The `hql` and `hiveql` methods continue to act the same but are now marked 
as deprecated.
    
    **This is a possibly breaking change for some users unless they set the 
dialect manually, though this is unlikely.**
    
    For example: `hiveContex.sql("SELECT 1")` will now throw a parsing 
exception by default.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #1746 from marmbrus/sqlLanguageConf and squashes the following 
commits:
    
    ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' 
into sqlLanguageConf
    20c43f8 [Michael Armbrust] override function instead of just setting the 
value
    7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config 
option, 'spark.sql.dialect'
    
    (cherry picked from commit 236dfac6769016e433b2f6517cda2d308dea74bc)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 6ffdcc61fb4825f991b754c45b807192f483a4a3
Author: Cheng Lian <[email protected]>
Date:   2014-08-03T19:34:46Z

    [SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native 
commands
    
    JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
    
    Author: Cheng Lian <[email protected]>
    
    Closes #1753 from liancheng/spark-2814 and squashes the following commits:
    
    c74a3b2 [Cheng Lian] Fixed SPARK-2814
    
    (cherry picked from commit ac33cbbf33bd1ab29bc8165c9be02fb8934b1fdf)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 7c6afdac867d52447221438ed7508123c07d17f8
Author: Yin Huai <[email protected]>
Date:   2014-08-03T21:54:41Z

    [SPARK-2783][SQL] Basic support for analyze in HiveContext
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2783
    
    Author: Yin Huai <[email protected]>
    
    Closes #1741 from yhuai/analyzeTable and squashes the following commits:
    
    7bb5f02 [Yin Huai] Use sql instead of hql.
    4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
analyzeTable
    e3ebcd4 [Yin Huai] Renaming.
    c170f4e [Yin Huai] Do not use getContentSummary.
    62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
analyzeTable
    db233a6 [Yin Huai] Trying to debug jenkins...
    fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
analyzeTable
    f0501f3 [Yin Huai] Fix compilation error.
    24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into 
analyzeTable
    8918140 [Yin Huai] Wording.
    23df227 [Yin Huai] Add a simple analyze method to get the size of a table 
and update the "totalSize" property of this table in the Hive metastore.
    
    (cherry picked from commit e139e2be60ef23281327744e1b3e74904dfdf63f)
    Signed-off-by: Michael Armbrust <[email protected]>

commit a4cdb77e5ee2c80967a7b6cd7370170fabe56cd2
Author: Davies Liu <[email protected]>
Date:   2014-08-03T22:52:00Z

    [SPARK-1740] [PySpark] kill the python worker
    
    Kill only the python worker related to cancelled tasks.
    
    The daemon will start a background thread to monitor all the opened sockets 
for all workers. If the socket is closed by JVM, this thread will kill the 
worker.
    
    When an task is cancelled, the socket to worker will be closed, then the 
worker will be killed by deamon.
    
    Author: Davies Liu <[email protected]>
    
    Closes #1643 from davies/kill and squashes the following commits:
    
    8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too 
heavy
    46ca150 [Davies Liu] address comment
    acd751c [Davies Liu] kill the worker when task is canceled
    
    (cherry picked from commit 55349f9fe81ba5af5e4a5e4908ebf174e63c6cc9)
    Signed-off-by: Josh Rosen <[email protected]>

commit 4784d24eadea2e1adf69d8fe4891bdce29188dd6
Author: Anand Avati <[email protected]>
Date:   2014-08-04T00:47:49Z

    [SPARK-2810] upgrade to scala-maven-plugin 3.2.0
    
    Needed for Scala 2.11 compiler-interface
    
    Signed-off-by: Anand Avati <avatiredhat.com>
    
    Author: Anand Avati <[email protected]>
    
    Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the 
following commits:
    
    9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0

commit 2152e24d64d6a07cf6c550c9f13ab0231596be98
Author: Sarah Gerweck <[email protected]>
Date:   2014-08-04T02:47:05Z

    Fix some bugs with spaces in directory name.
    
    Any time you use the directory name (`FWDIR`) it needs to be surrounded
    in quotes. If you're also using wildcards, you can safely put the quotes
    around just `$FWDIR`.
    
    Author: Sarah Gerweck <[email protected]>
    
    Closes #1756 from sarahgerweck/folderSpaces and squashes the following 
commits:
    
    732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
    (cherry picked from commit 5507dd8e18fbb52d5e0c64a767103b2418cb09c6)
    
    Signed-off-by: Patrick Wendell <[email protected]>

commit 9aa14598f89bb8b908222e37f965178d39c34fe6
Author: DB Tsai <[email protected]>
Date:   2014-08-04T04:39:21Z

    SPARK-2272 [MLlib] Feature scaling which standardizes the range of 
independent variables or features of data
    
    Feature scaling is a method used to standardize the range of independent 
variables or features of data. In data processing, it is generally performed 
during the data preprocessing step.
    
    In this work, a trait called `VectorTransformer` is defined for generic 
transformation on a vector. It contains one method to be implemented, 
`transform` which applies transformation on a vector.
    
    There are two implementations of `VectorTransformer` now, and they all can 
be easily extended with PMML transformation support.
    
    1) `StandardScaler` - Standardizes features by removing the mean and 
scaling to unit variance using column summary statistics on the samples in the 
training set.
    
    2) `Normalizer` - Normalizes samples individually to unit L^n norm
    
    Author: DB Tsai <[email protected]>
    
    Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following 
commits:
    
    78c15d3 [DB Tsai] Alpine Data Labs
    
    (cherry picked from commit ae58aea2d1435b5bb011e68127e1bcddc2edf5b2)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 3823f6d25e2a89ca1bfa62a76f6e708c2c63f064
Author: Liquan Pei <[email protected]>
Date:   2014-08-04T06:55:58Z

    [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
    
    This is a pull request regarding SPARK-2510 at 
https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector 
representation of words in a text corpus. The algorithm first constructs a 
vocabulary from the corpus and then learns vector representation of words in 
the vocabulary. The vector representation can be used as features in natural 
language processing and machine learning algorithms.
    
    To make our implementation more scalable, we train each partition 
separately and merge the model of each partition after each iteration. To make 
the model more accurate, multiple iterations may be needed.
    
    To investigate the vector representations is to find the closest words for 
a query word. For example, the top 20 closest words to "china" are for 1 
partition and 1 iteration :
    
    taiwan 0.8077646146334014
    korea 0.740913304563621
    japan 0.7240667798885471
    republic 0.7107151279078352
    thailand 0.6953217332072862
    tibet 0.6916782118129544
    mongolia 0.6800858715972612
    macau 0.6794925677480378
    singapore 0.6594048695593799
    manchuria 0.658989931844148
    laos 0.6512978726001666
    nepal 0.6380792327845325
    mainland 0.6365469459587788
    myanmar 0.6358614338840394
    macedonia 0.6322366180313249
    xinjiang 0.6285291551708028
    russia 0.6279951236068411
    india 0.6272874944023487
    shanghai 0.6234544135576999
    macao 0.6220588462925876
    
    The result with 10 partitions and 5 iterations is:
    taiwan 0.8310495079388313
    india 0.7737171315919039
    japan 0.756777901233668
    korea 0.7429767187102452
    indonesia 0.7407557427278356
    pakistan 0.712883426985585
    mainland 0.7053379963140822
    thailand 0.696298191073948
    mongolia 0.693690656871415
    laos 0.6913069680735292
    macau 0.6903427690029617
    republic 0.6766381604813666
    malaysia 0.676460699141784
    singapore 0.6728790997360923
    malaya 0.672345232966194
    manchuria 0.6703732292753156
    macedonia 0.6637955686322028
    myanmar 0.6589462882439646
    kazakhstan 0.657017801081494
    cambodia 0.6542383836451932
    
    Author: Liquan Pei <[email protected]>
    Author: Xiangrui Meng <[email protected]>
    Author: Liquan Pei <[email protected]>
    
    Closes #1719 from Ishiihara/master and squashes the following commits:
    
    2ba9483 [Liquan Pei] minor fix for Word2Vec test
    e248441 [Liquan Pei] minor style change
    26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
    c14da41 [Xiangrui Meng] fix styles
    384c771 [Xiangrui Meng] remove minCount and window from constructor change 
model to use float instead of double
    e93e726 [Liquan Pei] use treeAggregate instead of aggregate
    1a8fb41 [Liquan Pei] use weighted sum in combOp
    7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
    6bcc8be [Liquan Pei] add multiple iteration support
    720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
    2e92b59 [Liquan Pei] modify according to feedback
    57dc50d [Liquan Pei] code formatting
    e4a04d3 [Liquan Pei] minor fix
    0aafb1b [Liquan Pei] Add comments, minor fixes
    8d6befe [Liquan Pei] initial commit
    
    (cherry picked from commit e053c55819363fab7068bb9165e3379f0c2f570c)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit bfd2f39581d958d5aafaa76994f44213bcdfbb69
Author: Davies Liu <[email protected]>
Date:   2014-08-04T19:13:41Z

    [SPARK-1687] [PySpark] pickable namedtuple
    
    Add an hook to replace original namedtuple with an pickable one, then 
namedtuple could be used in RDDs.
    
    PS: pyspark should be import BEFORE "from collections import namedtuple"
    
    Author: Davies Liu <[email protected]>
    
    Closes #1623 from davies/namedtuple and squashes the following commits:
    
    045dad8 [Davies Liu] remove unrelated code changes
    4132f32 [Davies Liu] address comment
    55b1c1a [Davies Liu] fix tests
    61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked 
one
    98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
namedtuple
    f7b1bde [Davies Liu] add hack for CloudPickleSerializer
    0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
namedtuple
    21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
    93b03b8 [Davies Liu] pickable namedtuple
    
    (cherry picked from commit 59f84a9531f7974a053fd4963ce9afd88273ea4c)
    Signed-off-by: Josh Rosen <[email protected]>

commit aa7a48ee905b95e57f64051ea887d4775b427603
Author: Matei Zaharia <[email protected]>
Date:   2014-08-04T19:59:18Z

    SPARK-2792. Fix reading too much or too little data from each stream in 
ExternalMap / Sorter
    
    All these changes are from mridulm's work in #1609, but extracted here to 
fix this specific issue and make it easier to merge not 1.1. This particular 
set of changes is to make sure that we read exactly the right range of bytes 
from each spill file in EAOM: some serializers can write bytes after the last 
object (e.g. the TC_RESET flag in Java serialization) and that would confuse 
the previous code into reading it as part of the next batch. There are also 
improvements to cleanup to make sure files are closed.
    
    In addition to bringing in the changes to ExternalAppendOnlyMap, I also 
copied them to the corresponding code in ExternalSorter and updated its test 
suite to test for the same issues.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #1722 from mateiz/spark-2792 and squashes the following commits:
    
    5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last 
object written too
    18fe865 [Matei Zaharia] Update docs on objectStreamReset
    576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
    0374217 [Matei Zaharia] Remove super paranoid code to close file handles
    bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in 
ExternalSorter too
    0d6dad7 [Matei Zaharia] Added Mridul's test changes for 
ExternalAppendOnlyMap
    9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for 
batch sizes

commit 2225d18a751b7a4470a93f3d9edebe0d33df75c8
Author: Davies Liu <[email protected]>
Date:   2014-08-04T22:54:52Z

    [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple
    
    serializer is imported multiple times during doctests, so it's better to 
make _hijack_namedtuple() safe to be called multiple times.
    
    Author: Davies Liu <[email protected]>
    
    Closes #1771 from davies/fix and squashes the following commits:
    
    1a9e336 [Davies Liu] fix unit tests
    
    (cherry picked from commit 9fd82dbbcb8b10debbe95f1acab53ae8b340f38e)
    Signed-off-by: Josh Rosen <[email protected]>

commit 4ed7b5a2ff08eccf23d90990a4d7a2663efaf204
Author: Reynold Xin <[email protected]>
Date:   2014-08-05T03:39:18Z

    [SPARK-2323] Exception in accumulator update should not crash DAGScheduler 
& SparkContext
    
    Author: Reynold Xin <[email protected]>
    
    Closes #1772 from rxin/accumulator-dagscheduler and squashes the following 
commits:
    
    6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should 
not crash DAGScheduler & SparkContext.
    
    (cherry picked from commit 05bf4e4aff0d052a53d3e64c43688f07e27fec50)
    Signed-off-by: Reynold Xin <[email protected]>

commit a0922854909176a24cc689a7e8595303dcf62f3f
Author: Matei Zaharia <[email protected]>
Date:   2014-08-05T06:27:53Z

    SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()
    
    Replaces this with an O(1) operation that does not have to shift over
    the whole tail of the array into the gap produced by the element removed.
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
    
    1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse 
ArrayBuffers
    eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid 
buffer.remove()
    
    (cherry picked from commit 066765d60d21b6b9943862b788e4a4bd07396e6c)
    Signed-off-by: Matei Zaharia <[email protected]>

commit d13d253fea6dd1f666c4c94087173f734843f2b5
Author: Matei Zaharia <[email protected]>
Date:   2014-08-05T06:41:03Z

    SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling 
collections
    
    This tracks memory properly if there are multiple spilling collections in 
the same task (which was a problem before), and also implements an algorithm 
that lets each thread grow up to 1 / 2N of the memory pool (where N is the 
number of threads) before spilling, which avoids an inefficiency with small 
spills we had before (some threads would spill many times at 0-1 MB because the 
pool was allocated elsewhere).
    
    Author: Matei Zaharia <[email protected]>
    
    Closes #1707 from mateiz/spark-2711 and squashes the following commits:
    
    debf75b [Matei Zaharia] Review comments
    24f28f3 [Matei Zaharia] Small rename
    c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially 
grant requests
    315e3a5 [Matei Zaharia] Some review comments
    b810120 [Matei Zaharia] Create central manager to track memory for all 
spilling collections
    
    (cherry picked from commit 4fde28c2063f673ec7f51d514ba62a73321960a1)
    Signed-off-by: Matei Zaharia <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Branch 1.1

Reply via email to