GitHub user huozhanfeng opened a pull request:
https://github.com/apache/spark/pull/1824
Branch 1.1
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-1.1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1824.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1824
----
commit e22110879cd149e94c9a5ca7466f787033572b15
Author: Andrew Or <[email protected]>
Date: 2014-08-02T19:11:50Z
[HOTFIX] Do not throw NPE if spark.test.home is not set
`spark.test.home` was introduced in #1734. This is fine for SBT but is
failing maven tests. Either way it shouldn't throw an NPE.
Author: Andrew Or <[email protected]>
Closes #1739 from andrewor14/fix-spark-test-home and squashes the following
commits:
ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
commit 8d6ac2b95ab48d9fffe82ef04cef3b22c2c139e0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-02T20:07:17Z
[SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.
API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and
LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()
Examples and testing:
* Added example testing classification and regression with batch
prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py
which tests single-example prediction with dense and sparse vectors
Also: Small bug fix in python/pyspark/mllib/_common.py: In
_linear_predictor_typecheck, changed check for RDD to use isinstance() instead
of type() in order to catch RDD subclasses.
CC mengxr manishamde
Author: Joseph K. Bradley <[email protected]>
Closes #1727 from jkbradley/decisiontree-python-new and squashes the
following commits:
3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py
Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in
pyspark/mllib/util.py caused by change in loadLibSVMFile behavior. (It used to
threshold labels at 0 to make them 0/1, but it now leaves them as they are.) *
Fixed small bug in loadLibSVMFile: If a data file had no features, then
loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD)
returns an empty RDD instead of an error. * Removed support for calling
predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache
serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into
decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is
the bound for unordered categorical features, not ordered ones. The upper bound
should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example
but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural
syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit
scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1
* In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check
numClassesForClassification only if algo=Classification. * Updates based on
comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm **
Small cleanups ** tree.Node: Made recursive helper methods private, and renamed
them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump
with 2 continuous variables for binary classification. Caused problems in
past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires
numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for
splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into
decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix:
Updated DecisionTreeSuite so that 3 tests fail. Will describe bug in next
commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.
Also added toString, depth, and numNodes methods to DecisionTreeModel.
(cherry picked from commit 3f67382e7c9c3f6a8f6ce124ab3fcb1a9c1a264f)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 91de0dc1654d609dc1ff8fa9a07ba18043ad61c6
Author: Yin Huai <[email protected]>
Date: 2014-08-02T20:16:41Z
[SQL] Set outputPartitioning of BroadcastHashJoin correctly.
I think we will not generate the plan triggering this bug at this moment.
But, let me explain it...
Right now, we are using `left.outputPartitioning` as the
`outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical
plan for cases like...
```sql
SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
FROM src
GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key
```
Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we
will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the
`BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using
`l.key` as the key, we will not introduce an `Exchange` for this aggregation.
However, `r`'s outputPartitioning may not match the required distribution of
the last `GROUP BY` and we fail to group data correctly.
JIRA is being reindexed. I will create a JIRA ticket once it is back online.
Author: Yin Huai <[email protected]>
Closes #1735 from yhuai/BroadcastHashJoin and squashes the following
commits:
96d9cb3 [Yin Huai] Set outputPartitioning correctly.
(cherry picked from commit 67bd8e3c217a80c3117a6e3853aa60fe13d08c91)
Signed-off-by: Michael Armbrust <[email protected]>
commit bb0ac6d7c91c491a99c252e6cb4aea40efe9b190
Author: Chris Fregly <[email protected]>
Date: 2014-08-02T20:35:35Z
[SPARK-1981] Add AWS Kinesis streaming support
Author: Chris Fregly <[email protected]>
Closes #1434 from cfregly/master and squashes the following commits:
4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be
more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back
into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with
JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples,
updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples
into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the
KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a
publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and
kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
(cherry picked from commit 91f9504e6086fac05b40545099f9818949c24bca)
Signed-off-by: Tathagata Das <[email protected]>
commit 7924d72cf8aae945d72f355c54c4fcb3d62e6c48
Author: GuoQiang Li <[email protected]>
Date: 2014-08-02T20:55:28Z
SPARK-2804: Remove scalalogging-slf4j dependency
This also Closes #1701.
Author: GuoQiang Li <[email protected]>
Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
commit 3b9f25f4259b254f3faa2a7d61e547089a69c259
Author: Michael Armbrust <[email protected]>
Date: 2014-08-02T23:33:48Z
[SPARK-2097][SQL] UDF Support
This patch adds the ability to register lambda functions written in Python,
Java or Scala as UDFs for use in SQL or HiveQL.
Scala:
```scala
registerFunction("strLenScala", (_: String).length)
sql("SELECT strLenScala('test')")
```
Python:
```python
sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
sqlCtx.sql("SELECT strLenPython('test')")
```
Java:
```java
sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>()
{
Override
public Integer call(String str) throws Exception {
return str.length();
}
}, DataType.IntegerType);
sqlContext.sql("SELECT stringLengthJava('test')");
```
Author: Michael Armbrust <[email protected]>
Closes #1063 from marmbrus/udfs and squashes the following commits:
9eda0fe [Michael Armbrust] newline
747c05e [Michael Armbrust] Add some scala UDF tests.
d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master'
into udfs
005d684 [Michael Armbrust] Fix naming and formatting.
d14dac8 [Michael Armbrust] Fix last line of autogened java files.
8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master'
into udfs
6a36890 [Michael Armbrust] Switch logging so that SQLContext can be
serializable.
7a83101 [Michael Armbrust] Drop toString
795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
e54fb45 [Michael Armbrust] Docs and tests.
437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests,
address review comments.
01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into udfs
8e6c932 [Michael Armbrust] WIP
3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master'
into udfs
6237c8d [Michael Armbrust] WIP
2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support
for Java UDFs.
0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and
Python.
(cherry picked from commit 158ad0bba9382fd494b4789b5628a9cec00cfa19)
Signed-off-by: Michael Armbrust <[email protected]>
commit 4230df4e1d6c59dc3405f46f5edf18c3825a5447
Author: Michael Armbrust <[email protected]>
Date: 2014-08-02T23:48:07Z
[SPARK-2785][SQL] Remove assertions that throw when users try unsupported
Hive commands.
Author: Michael Armbrust <[email protected]>
Closes #1742 from marmbrus/asserts and squashes the following commits:
5182d54 [Michael Armbrust] Remove assertions that throw when users try
unsupported Hive commands.
(cherry picked from commit 198df11f1a9f419f820f47eba0e9f2ab371a824b)
Signed-off-by: Michael Armbrust <[email protected]>
commit 460fad817da1fb6619d2456f637c1b7c7f5e8c7c
Author: Cheng Lian <[email protected]>
Date: 2014-08-03T00:12:49Z
[SPARK-2729][SQL] Added test case for SPARK-2729
This is a follow up of #1636.
Author: Cheng Lian <[email protected]>
Closes #1738 from liancheng/test-for-spark-2729 and squashes the following
commits:
b13692a [Cheng Lian] Added test case for SPARK-2729
(cherry picked from commit 866cf1f822cfda22294054be026ef2d96307eb75)
Signed-off-by: Michael Armbrust <[email protected]>
commit 5ef828273deb4713a49700c56d51bdd980917cfd
Author: Yin Huai <[email protected]>
Date: 2014-08-03T00:55:22Z
[SPARK-2797] [SQL] SchemaRDDs don't support unpersist()
The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
Author: Yin Huai <[email protected]>
Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called
without the input parameter (blocking) from PySpark.
(cherry picked from commit d210022e96804e59e42ab902e53637e50884a9ab)
Signed-off-by: Michael Armbrust <[email protected]>
commit 5b30e001839a29e6c4bd1fc24bfa12d9166ef10c
Author: Michael Armbrust <[email protected]>
Date: 2014-08-03T01:27:04Z
[SPARK-2739][SQL] Rename registerAsTable to registerTempTable
There have been user complaints that the difference between
`registerAsTable` and `saveAsTable` is too subtle. This PR addresses this by
renaming `registerAsTable` to `registerTempTable`, which more clearly reflects
what is happening. `registerAsTable` remains, but will cause a deprecation
warning.
Author: Michael Armbrust <[email protected]>
Closes #1743 from marmbrus/registerTempTable and squashes the following
commits:
d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master'
into registerTempTable
4dff086 [Michael Armbrust] Fix .java files too
89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master'
into registerTempTable
0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
(cherry picked from commit 1a8043739dc1d9435def6ea3c6341498ba52b708)
Signed-off-by: Michael Armbrust <[email protected]>
commit 0d47bb642f645c3c8663f4bdf869b5337ef9cb35
Author: Sean Owen <[email protected]>
Date: 2014-08-03T04:44:19Z
SPARK-2602 [BUILD] Tests steal focus under Java 6
As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be
resolved for Java 6 with the java.awt.headless system property, which never
hurt anyone running a command line app. I tested it and seemed to get rid of
focus stealing.
Author: Sean Owen <[email protected]>
Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
b141018 [Sean Owen] Set java.awt.headless during tests
(cherry picked from commit 33f167d762483b55d5d874dcc1e3075f661d4375)
Signed-off-by: Patrick Wendell <[email protected]>
commit c137928cbe74446254fdbd656c50c1a1c8930094
Author: Sean Owen <[email protected]>
Date: 2014-08-03T04:55:56Z
SPARK-2414 [BUILD] Add LICENSE entry for jquery
The JIRA concerned removing jquery, and this does not remove jquery. While
it is distributed by Spark it should have an accompanying line in LICENSE, very
technically, as per http://www.apache.org/dev/licensing-howto.html
Author: Sean Owen <[email protected]>
Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
2fdb03c [Sean Owen] Add LICENSE entry for jquery
(cherry picked from commit 9cf429aaf529e91f619910c33cfe46bf33a66982)
Signed-off-by: Patrick Wendell <[email protected]>
commit fb2a2079fa10ea8f338d68945a94238dda9fbd66
Author: Andrew Or <[email protected]>
Date: 2014-08-03T05:00:46Z
[Minor] Fixes on top of #1679
Minor fixes on top of #1679.
Author: Andrew Or <[email protected]>
Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
3b46f5e [Andrew Or] Minor fixes
(cherry picked from commit 3dc55fdf450b4237f7c592fce56d1467fd206366)
Signed-off-by: Patrick Wendell <[email protected]>
commit 1992175fd93f0239e5a09e0b8db99ad9af7f380c
Author: Stephen Boesch <[email protected]>
Date: 2014-08-03T17:19:04Z
SPARK-2712 - Add a small note to maven doc that mvn package must happen ...
Per request by Reynold adding small note about proper sequencing of build
then test.
Author: Stephen Boesch <[email protected]>
Closes #1615 from javadba/docs and squashes the following commits:
6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that
mvn package must happen before test
(cherry picked from commit f8cd143b6b1b4d8aac87c229e5af263b0319b3ea)
Signed-off-by: Patrick Wendell <[email protected]>
commit 162fc9512018e0c592b3aaa29d405f511461795a
Author: Allan Douglas R. de Oliveira <[email protected]>
Date: 2014-08-03T17:25:59Z
SPARK-2246: Add user-data option to EC2 scripts
Author: Allan Douglas R. de Oliveira <[email protected]>
Closes #1186 from douglaz/spark_ec2_user_data and squashes the following
commits:
94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
(cherry picked from commit a0bcbc159e89be868ccc96175dbf1439461557e1)
Signed-off-by: Patrick Wendell <[email protected]>
commit eaa93555a7f935b00a2f94a7fa50a12e11578bd7
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-03T17:36:52Z
[SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to
DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from
Java.
Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
--> Note: I removed quantileCalculationStrategy from the Java-friendly
constructor since (a) it is a special class and (b) there is only 1 option
currently. I suspect we will redo the API before the other options are
included.
CC: mengxr
Author: Joseph K. Bradley <[email protected]>
Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead
of JavaConversions
519b1b7 [Joseph K. Bradley] * Organized imports in
JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in
DecisionTreeSuite.scala
f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run
DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow
passing RDDs from Java. * impurity classes: Added instance() methods to help
with Java interface. * Strategy: Added Java-friendly constructor ** Note: I
removed quantileCalculationStrategy from the Java-friendly constructor since
(a) it is a special class and (b) there is only 1 option currently. I suspect
we will redo the API before the other options are included.
d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-java
320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master'
into dt-java
f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated
later
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is
the bound for unordered categorical features, not ordered ones. The upper bound
should be the arity (i.e., max value) of the feature.
(cherry picked from commit 2998e38a942351974da36cb619e863c6f0316e7a)
Signed-off-by: Xiangrui Meng <[email protected]>
commit c5ed1deba6b3f3e597554a8d0f93f402ae62fab9
Author: Michael Armbrust <[email protected]>
Date: 2014-08-03T19:28:29Z
[SPARK-2784][SQL] Deprecate hql() method in favor of a config option,
'spark.sql.dialect'
Many users have reported being confused by the distinction between the
`sql` and `hql` methods. Specifically, many users think that `sql(...)` cannot
be used to read hive tables. In this PR I introduce a new configuration option
`spark.sql.dialect` that picks which dialect with be used for parsing. For
SQLContext this must be set to `sql`. In `HiveContext` it defaults to `hiveql`
but can also be set to `sql`.
The `hql` and `hiveql` methods continue to act the same but are now marked
as deprecated.
**This is a possibly breaking change for some users unless they set the
dialect manually, though this is unlikely.**
For example: `hiveContex.sql("SELECT 1")` will now throw a parsing
exception by default.
Author: Michael Armbrust <[email protected]>
Closes #1746 from marmbrus/sqlLanguageConf and squashes the following
commits:
ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master'
into sqlLanguageConf
20c43f8 [Michael Armbrust] override function instead of just setting the
value
7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config
option, 'spark.sql.dialect'
(cherry picked from commit 236dfac6769016e433b2f6517cda2d308dea74bc)
Signed-off-by: Michael Armbrust <[email protected]>
commit 6ffdcc61fb4825f991b754c45b807192f483a4a3
Author: Cheng Lian <[email protected]>
Date: 2014-08-03T19:34:46Z
[SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native
commands
JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
Author: Cheng Lian <[email protected]>
Closes #1753 from liancheng/spark-2814 and squashes the following commits:
c74a3b2 [Cheng Lian] Fixed SPARK-2814
(cherry picked from commit ac33cbbf33bd1ab29bc8165c9be02fb8934b1fdf)
Signed-off-by: Michael Armbrust <[email protected]>
commit 7c6afdac867d52447221438ed7508123c07d17f8
Author: Yin Huai <[email protected]>
Date: 2014-08-03T21:54:41Z
[SPARK-2783][SQL] Basic support for analyze in HiveContext
JIRA: https://issues.apache.org/jira/browse/SPARK-2783
Author: Yin Huai <[email protected]>
Closes #1741 from yhuai/analyzeTable and squashes the following commits:
7bb5f02 [Yin Huai] Use sql instead of hql.
4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into
analyzeTable
e3ebcd4 [Yin Huai] Renaming.
c170f4e [Yin Huai] Do not use getContentSummary.
62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into
analyzeTable
db233a6 [Yin Huai] Trying to debug jenkins...
fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into
analyzeTable
f0501f3 [Yin Huai] Fix compilation error.
24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into
analyzeTable
8918140 [Yin Huai] Wording.
23df227 [Yin Huai] Add a simple analyze method to get the size of a table
and update the "totalSize" property of this table in the Hive metastore.
(cherry picked from commit e139e2be60ef23281327744e1b3e74904dfdf63f)
Signed-off-by: Michael Armbrust <[email protected]>
commit a4cdb77e5ee2c80967a7b6cd7370170fabe56cd2
Author: Davies Liu <[email protected]>
Date: 2014-08-03T22:52:00Z
[SPARK-1740] [PySpark] kill the python worker
Kill only the python worker related to cancelled tasks.
The daemon will start a background thread to monitor all the opened sockets
for all workers. If the socket is closed by JVM, this thread will kill the
worker.
When an task is cancelled, the socket to worker will be closed, then the
worker will be killed by deamon.
Author: Davies Liu <[email protected]>
Closes #1643 from davies/kill and squashes the following commits:
8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too
heavy
46ca150 [Davies Liu] address comment
acd751c [Davies Liu] kill the worker when task is canceled
(cherry picked from commit 55349f9fe81ba5af5e4a5e4908ebf174e63c6cc9)
Signed-off-by: Josh Rosen <[email protected]>
commit 4784d24eadea2e1adf69d8fe4891bdce29188dd6
Author: Anand Avati <[email protected]>
Date: 2014-08-04T00:47:49Z
[SPARK-2810] upgrade to scala-maven-plugin 3.2.0
Needed for Scala 2.11 compiler-interface
Signed-off-by: Anand Avati <avatiredhat.com>
Author: Anand Avati <[email protected]>
Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the
following commits:
9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0
commit 2152e24d64d6a07cf6c550c9f13ab0231596be98
Author: Sarah Gerweck <[email protected]>
Date: 2014-08-04T02:47:05Z
Fix some bugs with spaces in directory name.
Any time you use the directory name (`FWDIR`) it needs to be surrounded
in quotes. If you're also using wildcards, you can safely put the quotes
around just `$FWDIR`.
Author: Sarah Gerweck <[email protected]>
Closes #1756 from sarahgerweck/folderSpaces and squashes the following
commits:
732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
(cherry picked from commit 5507dd8e18fbb52d5e0c64a767103b2418cb09c6)
Signed-off-by: Patrick Wendell <[email protected]>
commit 9aa14598f89bb8b908222e37f965178d39c34fe6
Author: DB Tsai <[email protected]>
Date: 2014-08-04T04:39:21Z
SPARK-2272 [MLlib] Feature scaling which standardizes the range of
independent variables or features of data
Feature scaling is a method used to standardize the range of independent
variables or features of data. In data processing, it is generally performed
during the data preprocessing step.
In this work, a trait called `VectorTransformer` is defined for generic
transformation on a vector. It contains one method to be implemented,
`transform` which applies transformation on a vector.
There are two implementations of `VectorTransformer` now, and they all can
be easily extended with PMML transformation support.
1) `StandardScaler` - Standardizes features by removing the mean and
scaling to unit variance using column summary statistics on the samples in the
training set.
2) `Normalizer` - Normalizes samples individually to unit L^n norm
Author: DB Tsai <[email protected]>
Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following
commits:
78c15d3 [DB Tsai] Alpine Data Labs
(cherry picked from commit ae58aea2d1435b5bb011e68127e1bcddc2edf5b2)
Signed-off-by: Xiangrui Meng <[email protected]>
commit 3823f6d25e2a89ca1bfa62a76f6e708c2c63f064
Author: Liquan Pei <[email protected]>
Date: 2014-08-04T06:55:58Z
[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
This is a pull request regarding SPARK-2510 at
https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector
representation of words in a text corpus. The algorithm first constructs a
vocabulary from the corpus and then learns vector representation of words in
the vocabulary. The vector representation can be used as features in natural
language processing and machine learning algorithms.
To make our implementation more scalable, we train each partition
separately and merge the model of each partition after each iteration. To make
the model more accurate, multiple iterations may be needed.
To investigate the vector representations is to find the closest words for
a query word. For example, the top 20 closest words to "china" are for 1
partition and 1 iteration :
taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876
The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932
Author: Liquan Pei <[email protected]>
Author: Xiangrui Meng <[email protected]>
Author: Liquan Pei <[email protected]>
Closes #1719 from Ishiihara/master and squashes the following commits:
2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change
model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit
(cherry picked from commit e053c55819363fab7068bb9165e3379f0c2f570c)
Signed-off-by: Xiangrui Meng <[email protected]>
commit bfd2f39581d958d5aafaa76994f44213bcdfbb69
Author: Davies Liu <[email protected]>
Date: 2014-08-04T19:13:41Z
[SPARK-1687] [PySpark] pickable namedtuple
Add an hook to replace original namedtuple with an pickable one, then
namedtuple could be used in RDDs.
PS: pyspark should be import BEFORE "from collections import namedtuple"
Author: Davies Liu <[email protected]>
Closes #1623 from davies/namedtuple and squashes the following commits:
045dad8 [Davies Liu] remove unrelated code changes
4132f32 [Davies Liu] address comment
55b1c1a [Davies Liu] fix tests
61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked
one
98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
namedtuple
f7b1bde [Davies Liu] add hack for CloudPickleSerializer
0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
namedtuple
21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
93b03b8 [Davies Liu] pickable namedtuple
(cherry picked from commit 59f84a9531f7974a053fd4963ce9afd88273ea4c)
Signed-off-by: Josh Rosen <[email protected]>
commit aa7a48ee905b95e57f64051ea887d4775b427603
Author: Matei Zaharia <[email protected]>
Date: 2014-08-04T19:59:18Z
SPARK-2792. Fix reading too much or too little data from each stream in
ExternalMap / Sorter
All these changes are from mridulm's work in #1609, but extracted here to
fix this specific issue and make it easier to merge not 1.1. This particular
set of changes is to make sure that we read exactly the right range of bytes
from each spill file in EAOM: some serializers can write bytes after the last
object (e.g. the TC_RESET flag in Java serialization) and that would confuse
the previous code into reading it as part of the next batch. There are also
improvements to cleanup to make sure files are closed.
In addition to bringing in the changes to ExternalAppendOnlyMap, I also
copied them to the corresponding code in ExternalSorter and updated its test
suite to test for the same issues.
Author: Matei Zaharia <[email protected]>
Closes #1722 from mateiz/spark-2792 and squashes the following commits:
5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last
object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in
ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for
ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for
batch sizes
commit 2225d18a751b7a4470a93f3d9edebe0d33df75c8
Author: Davies Liu <[email protected]>
Date: 2014-08-04T22:54:52Z
[SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple
serializer is imported multiple times during doctests, so it's better to
make _hijack_namedtuple() safe to be called multiple times.
Author: Davies Liu <[email protected]>
Closes #1771 from davies/fix and squashes the following commits:
1a9e336 [Davies Liu] fix unit tests
(cherry picked from commit 9fd82dbbcb8b10debbe95f1acab53ae8b340f38e)
Signed-off-by: Josh Rosen <[email protected]>
commit 4ed7b5a2ff08eccf23d90990a4d7a2663efaf204
Author: Reynold Xin <[email protected]>
Date: 2014-08-05T03:39:18Z
[SPARK-2323] Exception in accumulator update should not crash DAGScheduler
& SparkContext
Author: Reynold Xin <[email protected]>
Closes #1772 from rxin/accumulator-dagscheduler and squashes the following
commits:
6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should
not crash DAGScheduler & SparkContext.
(cherry picked from commit 05bf4e4aff0d052a53d3e64c43688f07e27fec50)
Signed-off-by: Reynold Xin <[email protected]>
commit a0922854909176a24cc689a7e8595303dcf62f3f
Author: Matei Zaharia <[email protected]>
Date: 2014-08-05T06:27:53Z
SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()
Replaces this with an O(1) operation that does not have to shift over
the whole tail of the array into the gap produced by the element removed.
Author: Matei Zaharia <[email protected]>
Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse
ArrayBuffers
eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid
buffer.remove()
(cherry picked from commit 066765d60d21b6b9943862b788e4a4bd07396e6c)
Signed-off-by: Matei Zaharia <[email protected]>
commit d13d253fea6dd1f666c4c94087173f734843f2b5
Author: Matei Zaharia <[email protected]>
Date: 2014-08-05T06:41:03Z
SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling
collections
This tracks memory properly if there are multiple spilling collections in
the same task (which was a problem before), and also implements an algorithm
that lets each thread grow up to 1 / 2N of the memory pool (where N is the
number of threads) before spilling, which avoids an inefficiency with small
spills we had before (some threads would spill many times at 0-1 MB because the
pool was allocated elsewhere).
Author: Matei Zaharia <[email protected]>
Closes #1707 from mateiz/spark-2711 and squashes the following commits:
debf75b [Matei Zaharia] Review comments
24f28f3 [Matei Zaharia] Small rename
c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially
grant requests
315e3a5 [Matei Zaharia] Some review comments
b810120 [Matei Zaharia] Create central manager to track memory for all
spilling collections
(cherry picked from commit 4fde28c2063f673ec7f51d514ba62a73321960a1)
Signed-off-by: Matei Zaharia <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]