GitHub user tengpeng opened a pull request:
https://github.com/apache/spark/pull/20729
[SPARK-23578][ML]Add multicolumn support for Binarizer
[Spark-20542] added an API that Bucketizer that can bin multiple columns.
Based on this change, a multicolumn support is added for Binarizer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tengpeng/spark Binarizer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20729.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20729
----
commit 9ca0f6eaf6744c090cab4ac6720cf11c9b83915e
Author: gatorsmile <gatorsmile@...>
Date: 2018-01-11T13:32:36Z
[SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite
DataSourceWithHiveMetastoreCatalogSuite
## What changes were proposed in this pull request?
The Spark 2.3 branch still failed due to the flaky test suite
`DataSourceWithHiveMetastoreCatalogSuite `.
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
Although https://github.com/apache/spark/pull/20207 is unable to reproduce
it in Spark 2.3, it sounds like the current DB of Spark's Catalog is changed
based on the following stacktrace. Thus, we just need to reset it.
```
[info] DataSourceWithHiveMetastoreCatalogSuite:
02:40:39.486 ERROR org.apache.hadoop.hive.ql.parse.CalcitePlanner:
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:14 Table not found 't'
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1594)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1545)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10077)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:694)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:683)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
at
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
at
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
at
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
at
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:683)
at
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:673)
at
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:185)
at
org.apache.spark.sql.test.SQLTestUtilsBase$class.withTable(SQLTestUtils.scala:273)
at
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:139)
at
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:163)
at
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
at
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
at
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
at
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
## How was this patch tested?
N/A
Author: gatorsmile <[email protected]>
Closes #20218 from gatorsmile/testFixAgain.
(cherry picked from commit 76892bcf2c08efd7e9c5b16d377e623d82fe695e)
Signed-off-by: gatorsmile <[email protected]>
commit f624850fe8acce52240217f376316734a23be00b
Author: gatorsmile <gatorsmile@...>
Date: 2018-01-11T13:33:42Z
[SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and
fillna
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/18164 introduces the behavior changes.
We need to document it.
## How was this patch tested?
N/A
Author: gatorsmile <[email protected]>
Closes #20234 from gatorsmile/docBehaviorChange.
(cherry picked from commit b46e58b74c82dac37b7b92284ea3714919c5a886)
Signed-off-by: hyukjinkwon <[email protected]>
commit b94debd2b01b87ef1d2a34d48877e38ade0969e6
Author: Marcelo Vanzin <vanzin@...>
Date: 2018-01-11T18:37:35Z
[SPARK-22994][K8S] Use a single image for all Spark containers.
This change allows a user to submit a Spark application on kubernetes
having to provide a single image, instead of one image for each type
of container. The image's entry point now takes an extra argument that
identifies the process that is being started.
The configuration still allows the user to provide different images
for each container type if they so desire.
On top of that, the entry point was simplified a bit to share more
code; mainly, the same env variable is used to propagate the user-defined
classpath to the different containers.
Aside from being modified to match the new behavior, the
'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh'
to more closely match its purpose; the old name was a little awkward
and now also not entirely correct, since there is a single image. It
was also moved to 'bin' since it's not necessarily an admin tool.
Docs have been updated to match the new behavior.
Tested locally with minikube.
Author: Marcelo Vanzin <[email protected]>
Closes #20192 from vanzin/SPARK-22994.
(cherry picked from commit 0b2eefb674151a0af64806728b38d9410da552ec)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit f891ee3249e04576dd579cbab6f8f1632550e6bd
Author: Jose Torres <jose@...>
Date: 2018-01-11T18:52:12Z
[SPARK-22908] Add kafka source and sink for continuous processing.
## What changes were proposed in this pull request?
Add kafka source and sink for continuous processing. This involves two
small changes to the execution engine:
* Bring data reader close() into the normal data reader thread to avoid
thread safety issues.
* Fix up the semantics of the RECONFIGURING StreamExecution state. State
updates are now atomic, and we don't have to deal with swallowing an exception.
## How was this patch tested?
new unit tests
Author: Jose Torres <[email protected]>
Closes #20096 from jose-torres/continuous-kafka.
(cherry picked from commit 6f7aaed805070d29dcba32e04ca7a1f581fa54b9)
Signed-off-by: Tathagata Das <[email protected]>
commit 2ec302658c98038962c9b7a90fd2cff751a35ffa
Author: Bago Amirbekian <bago@...>
Date: 2018-01-11T21:57:15Z
[SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline
## What changes were proposed in this pull request?
Including VectorSizeHint in RFormula piplelines will allow them to be
applied to streaming dataframes.
## How was this patch tested?
Unit tests.
Author: Bago Amirbekian <[email protected]>
Closes #20238 from MrBago/rFormulaVectorSize.
(cherry picked from commit 186bf8fb2e9ff8a80f3f6bcb5f2a0327fa79a1c9)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea
Author: Sameer Agarwal <sameerag@...>
Date: 2018-01-11T23:23:10Z
Preparing Spark release v2.3.0-rc1
commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal <sameerag@...>
Date: 2018-01-11T23:23:17Z
Preparing development version 2.3.1-SNAPSHOT
commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu <weichen.xu@...>
Date: 2018-01-12T00:20:30Z
[SPARK-23008][ML] OnehotEncoderEstimator python API
## What changes were proposed in this pull request?
OnehotEncoderEstimator python API.
## How was this patch tested?
doctest
Author: WeichenXu <[email protected]>
Closes #20209 from WeichenXu123/ohe_py.
(cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
Signed-off-by: Joseph K. Bradley <[email protected]>
commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj <ho3rexqj@...>
Date: 2018-01-12T07:27:00Z
[SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances
of broadcast variable values
When resources happen to be constrained on an executor the first time a
broadcast variable is instantiated it is persisted to disk by the BlockManager.
Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock
from other instances of that broadcast variable spawns another instance of the
underlying value. That is, broadcast variables are spawned once per executor
**unless** memory is constrained, in which case every instance of a broadcast
variable is provided with a unique copy of the underlying value.
This patch fixes the above by explicitly caching the underlying values
using weak references in a ReferenceMap.
Author: ho3rexqj <[email protected]>
Closes #20183 from ho3rexqj/fix/cache-broadcast-values.
(cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
Signed-off-by: Wenchen Fan <[email protected]>
commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu <weichen.xu@...>
Date: 2018-01-12T09:27:02Z
[SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated
## What changes were proposed in this pull request?
mark OneHotEncoder python API deprecated
## How was this patch tested?
N/A
Author: WeichenXu <[email protected]>
Closes #20241 from WeichenXu123/mark_ohe_deprecated.
(cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
Signed-off-by: Nick Pentreath <[email protected]>
commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido <marcogaido91@...>
Date: 2018-01-12T10:04:44Z
[SPARK-23025][SQL] Support Null type in scala reflection
## What changes were proposed in this pull request?
Add support for `Null` type in the `schemaFor` method for Scala reflection.
## How was this patch tested?
Added UT
Author: Marco Gaido <[email protected]>
Closes #20219 from mgaido91/SPARK-23025.
(cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
Signed-off-by: gatorsmile <[email protected]>
commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-01-12T18:18:42Z
[MINOR][BUILD] Fix Java linter errors
## What changes were proposed in this pull request?
This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully,
this will be the final one.
```
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR]
src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85]
(sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR]
src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8]
(imports) UnusedImports: Unused import - java.io.IOException.
[ERROR]
src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9]
(modifier) ModifierOrder: 'private' modifier out of order with the JLS
suggestions.
[ERROR]
src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java:[464] (sizes)
LineLength: Line is longer than 100 characters (found 102).
```
## How was this patch tested?
Manual.
```
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <[email protected]>
Closes #20242 from dongjoon-hyun/fix_lint_java_2.3_rc1.
(cherry picked from commit 7bd14cfd40500a0b6462cda647bdbb686a430328)
Signed-off-by: Sameer Agarwal <[email protected]>
commit 02176f4c2f60342068669b215485ffd443346aed
Author: Marco Gaido <marcogaido91@...>
Date: 2018-01-12T19:25:37Z
[SPARK-22975][SS] MetricsReporter should not throw exception when there was
no progress reported
## What changes were proposed in this pull request?
`MetricsReporter ` assumes that there has been some progress for the query,
ie. `lastProgress` is not null. If this is not true, as it might happen in
particular conditions, a `NullPointerException` can be thrown.
The PR checks whether there is a `lastProgress` and if this is not true, it
returns a default value for the metrics.
## How was this patch tested?
added UT
Author: Marco Gaido <[email protected]>
Closes #20189 from mgaido91/SPARK-22975.
(cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 60bcb4685022c29a6ddcf707b505369687ec7da6
Author: Sameer Agarwal <sameerag@...>
Date: 2018-01-12T23:07:14Z
Revert "[SPARK-22908] Add kafka source and sink for continuous processing."
This reverts commit f891ee3249e04576dd579cbab6f8f1632550e6bd.
commit ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee
Author: hyukjinkwon <gurwls223@...>
Date: 2018-01-13T07:13:44Z
[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each
batch within scalar Pandas UDF
## What changes were proposed in this pull request?
This PR proposes to add a note that saying the length of a scalar Pandas
UDF's `Series` is not of the whole input column but of the batch.
We are fine for a group map UDF because the usage is different from our
typical UDF but scalar UDFs might cause confusion with the normal UDF.
For example, please consider this example:
```python
from pyspark.sql.functions import pandas_udf, col, lit
df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
```
```
+------------------+
|<lambda>(text, id)|
+------------------+
| 1|
+------------------+
```
```python
from pyspark.sql.functions import udf, col, lit
df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()
```
```
+------------------+
|<lambda>(text, id)|
+------------------+
| 4|
+------------------+
```
## How was this patch tested?
Manually built the doc and checked the output.
Author: hyukjinkwon <[email protected]>
Closes #20237 from HyukjinKwon/SPARK-22980.
(cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3)
Signed-off-by: hyukjinkwon <[email protected]>
commit 801ffd799922e1c2751d3331874b88a67da8cf01
Author: Yuming Wang <yumwang@...>
Date: 2018-01-13T16:01:44Z
[SPARK-22870][CORE] Dynamic allocation should allow 0 idle time
## What changes were proposed in this pull request?
This pr to make `0` as a valid value for
`spark.dynamicAllocation.executorIdleTimeout`.
For details, see the jira description:
https://issues.apache.org/jira/browse/SPARK-22870.
## How was this patch tested?
N/A
Author: Yuming Wang <[email protected]>
Author: Yuming Wang <[email protected]>
Closes #20080 from wangyum/SPARK-22870.
(cherry picked from commit fc6fe8a1d0f161c4788f3db94de49a8669ba3bcc)
Signed-off-by: Sean Owen <[email protected]>
commit 8d32ed5f281317ba380aa6b8b3f3f041575022cb
Author: xubo245 <601450868@...>
Date: 2018-01-13T18:28:57Z
[SPARK-23036][SQL][TEST] Add withGlobalTempView for testing
## What changes were proposed in this pull request?
Add withGlobalTempView when create global temp view, like withTempView and
withView.
And correct some improper usage.
Please see jira.
There are other similar place like that. I will fix it if community need.
Please confirm it.
## How was this patch tested?
no new test.
Author: xubo245 <[email protected]>
Closes #20228 from xubo245/DropTempView.
(cherry picked from commit bd4a21b4820c4ebaf750131574a6b2eeea36907e)
Signed-off-by: gatorsmile <[email protected]>
commit 0fc5533e53ad03eb67590ddd231f40c2713150c3
Author: CodingCat <zhunansjtu@...>
Date: 2018-01-13T18:36:32Z
[SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's
size
## What changes were proposed in this pull request?
as per discussion in
https://github.com/apache/spark/pull/19864#discussion_r156847927
the current HadoopFsRelation is purely based on the underlying file size
which is not accurate and makes the execution vulnerable to errors like OOM
Users can enable CBO with the functionalities in
https://github.com/apache/spark/pull/19864 to avoid this issue
This JIRA proposes to add a configurable factor to sizeInBytes method in
HadoopFsRelation class so that users can mitigate this problem without CBO
## How was this patch tested?
Existing tests
Author: CodingCat <[email protected]>
Author: Nan Zhu <[email protected]>
Closes #20072 from CodingCat/SPARK-22790.
(cherry picked from commit ba891ec993c616dc4249fc786c56ea82ed04a827)
Signed-off-by: gatorsmile <[email protected]>
commit bcd87ae0775d16b7c3b9de0c4f2db36eb3679476
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-01-13T21:39:38Z
[SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in
compareAndGetNewStats
## What changes were proposed in this pull request?
This pr fixed code to compare values in `compareAndGetNewStats`.
The test below fails in the current master;
```
val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) *
2)
val newStats5 = CommandUtils.compareAndGetNewStats(
Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None)
assert(newStats5.isEmpty)
```
## How was this patch tested?
Added some tests in `CommandUtilsSuite`.
Author: Takeshi Yamamuro <[email protected]>
Closes #20245 from maropu/SPARK-21213-FOLLOWUP.
(cherry picked from commit 0066d6f6fa604817468471832968d4339f71c5cb)
Signed-off-by: gatorsmile <[email protected]>
commit 1f4a08b15ab47cf6c3bb08c783497422f30d0709
Author: foxish <ramanathana@...>
Date: 2018-01-14T05:34:28Z
[SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of
other misses)
## What changes were proposed in this pull request?
Including the `-Pkubernetes` flag in a few places it was missed.
## How was this patch tested?
checkstyle, mima through manual tests.
Author: foxish <[email protected]>
Closes #20256 from foxish/SPARK-23063.
(cherry picked from commit c3548d11c3c57e8f2c6ebd9d2d6a3924ddcd3cba)
Signed-off-by: Felix Cheung <[email protected]>
commit a335a49ce4672b44e5f818145214040a67c722ba
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-01-14T07:26:12Z
[SPARK-23038][TEST] Update docker/spark-test (JDK/OS)
## What changes were proposed in this pull request?
This PR aims to update the followings in `docker/spark-test`.
- JDK7 -> JDK8
Spark 2.2+ supports JDK8 only.
- Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel)
The end of life of `precise` was April 28, 2017.
## How was this patch tested?
Manual.
* Master
```
$ cd external/docker
$ ./build
$ export SPARK_HOME=...
$ docker run -v $SPARK_HOME:/opt/spark spark-test-master
CONTAINER_IP=172.17.0.3
...
18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and
started at http://172.17.0.3:8080
18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066.
18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for
submitting applications on port 6066
18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE
```
* Slave
```
$ docker run -v $SPARK_HOME:/opt/spark spark-test-worker
spark://172.17.0.3:7077
CONTAINER_IP=172.17.0.4
...
18/01/11 06:51:54 INFO Worker: Successfully registered with master
spark://172.17.0.3:7077
```
After slave starts, master will show
```
18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4
cores, 1024.0 MB RAM
```
Author: Dongjoon Hyun <[email protected]>
Closes #20230 from dongjoon-hyun/SPARK-23038.
(cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d)
Signed-off-by: Felix Cheung <[email protected]>
commit 0d425c3362dc648d5c85b2b07af1a9df23b6d422
Author: Felix Cheung <felixcheung_m@...>
Date: 2018-01-14T10:43:10Z
[SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text
## What changes were proposed in this pull request?
fix doc truncated
## How was this patch tested?
manually
Author: Felix Cheung <[email protected]>
Closes #20263 from felixcheung/r23docfix.
(cherry picked from commit 66738d29c59871b29d26fc3756772b95ef536248)
Signed-off-by: hyukjinkwon <[email protected]>
commit 5fbbd94d509dbbcfa1fe940569049f72ff4a6e89
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-01-14T14:26:21Z
[SPARK-23021][SQL] AnalysisBarrier should override innerChildren to print
correct explain output
## What changes were proposed in this pull request?
`AnalysisBarrier` in the current master cuts off explain results for parsed
logical plans;
```
scala> Seq((1, 1)).toDF("a",
"b").groupBy("a").count().sample(0.1).explain(true)
== Parsed Logical Plan ==
Sample 0.0, 0.1, false, -7661439431999668039
+- AnalysisBarrier Aggregate [a#5], [a#5, count(1) AS count#14L]
```
To fix this, `AnalysisBarrier` needs to override `innerChildren` and this
pr changed the output to;
```
== Parsed Logical Plan ==
Sample 0.0, 0.1, false, -5086223488015741426
+- AnalysisBarrier
+- Aggregate [a#5], [a#5, count(1) AS count#14L]
+- Project [_1#2 AS a#5, _2#3 AS b#6]
+- LocalRelation [_1#2, _2#3]
```
## How was this patch tested?
Added tests in `DataFrameSuite`.
Author: Takeshi Yamamuro <[email protected]>
Closes #20247 from maropu/SPARK-23021-2.
(cherry picked from commit 990f05c80347c6eec2ee06823cff587c9ea90b49)
Signed-off-by: gatorsmile <[email protected]>
commit 9051e1a265dc0f1dc19fd27a0127ffa47f3ac245
Author: Sandor Murakozi <smurakozi@...>
Date: 2018-01-14T14:32:35Z
[SPARK-23051][CORE] Fix for broken job description in Spark UI
## What changes were proposed in this pull request?
In 2.2, Spark UI displayed the stage description if the job description was
not set. This functionality was broken, the GUI has shown no description in
this case. In addition, the code uses jobName and
jobDescription instead of stageName and stageDescription when
JobTableRowData is created.
In this PR the logic producing values for the job rows was modified to find
the latest stage attempt for the job and use that as a fallback if job
description was missing.
StageName and stageDescription are also set using values from stage and
jobName/description is used only as a fallback.
## How was this patch tested?
Manual testing of the UI, using the code in the bug report.
Author: Sandor Murakozi <[email protected]>
Closes #20251 from smurakozi/SPARK-23051.
(cherry picked from commit 60eeecd7760aee6ce2fd207c83ae40054eadaf83)
Signed-off-by: Sean Owen <[email protected]>
commit 2879236b92b5712b7438b972404375bbf1993df8
Author: guoxiaolong <guo.xiaolong1@...>
Date: 2018-01-14T18:02:49Z
[SPARK-22999][SQL] show databases like command' can remove the like keyword
## What changes were proposed in this pull request?
SHOW DATABASES (LIKE pattern = STRING)? Can be like the back increase?
When using this command, LIKE keyword can be removed.
You can refer to the SHOW TABLES command, SHOW TABLES 'test *' and SHOW
TABELS like 'test *' can be used.
Similarly SHOW DATABASES 'test *' and SHOW DATABASES like 'test *' can be
used.
## How was this patch tested?
unit tests manual tests
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: guoxiaolong <[email protected]>
Closes #20194 from guoxiaolongzte/SPARK-22999.
(cherry picked from commit 42a1a15d739890bdfbb367ef94198b19e98ffcb7)
Signed-off-by: gatorsmile <[email protected]>
commit 30574fd3716dbdf553cfd0f4d33164ab8fbccb77
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-01-15T02:55:21Z
[SPARK-23054][SQL] Fix incorrect results of casting UserDefinedType to
String
## What changes were proposed in this pull request?
This pr fixed the issue when casting `UserDefinedType`s into strings;
```
>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0,
Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0] |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+
```
The root cause is that `Cast` handles input data as
`UserDefinedType.sqlType`(this is underlying storage type), so we should pass
data into `UserDefinedType.deserialize` then `toString`.
This pr modified the result into;
```
+---------+
|features |
+---------+
|[0.0,0.0]|
|[0.0,1.0]|
+---------+
```
## How was this patch tested?
Added tests in `UserDefinedTypeSuite `.
Author: Takeshi Yamamuro <[email protected]>
Closes #20246 from maropu/SPARK-23054.
(cherry picked from commit b98ffa4d6dabaf787177d3f14b200fc4b118c7ce)
Signed-off-by: Wenchen Fan <[email protected]>
commit 81b989903af0cdcb6c19e6e8e7bdbac455a2c281
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-01-15T04:06:56Z
[SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` should work for ORC
files
## What changes were proposed in this pull request?
When `spark.sql.files.ignoreCorruptFiles=true`, we should ignore corrupted
ORC files.
## How was this patch tested?
Pass the Jenkins with a newly added test case.
Author: Dongjoon Hyun <[email protected]>
Closes #20240 from dongjoon-hyun/SPARK-23049.
(cherry picked from commit 9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43)
Signed-off-by: Wenchen Fan <[email protected]>
commit 188999a3401357399d8d2b30f440d8b0b0795fc5
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-01-15T08:26:52Z
[SPARK-23023][SQL] Cast field data to strings in showString
## What changes were proposed in this pull request?
The current `Datset.showString` prints rows thru `RowEncoder` deserializers
like;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------------------------------------------+
|a |
+------------------------------------------------------------+
|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
+------------------------------------------------------------+
```
This result is incorrect because the correct one is;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------+
|a |
+------------------------+
|[[1, 2], [3], [4, 5, 6]]|
+------------------------+
```
So, this pr fixed code in `showString` to cast field data to strings before
printing.
## How was this patch tested?
Added tests in `DataFrameSuite`.
Author: Takeshi Yamamuro <[email protected]>
Closes #20214 from maropu/SPARK-23023.
(cherry picked from commit b59808385cfe24ce768e5b3098b9034e64b99a5a)
Signed-off-by: Wenchen Fan <[email protected]>
commit 3491ca4fb5c2e3fecd727f7a31b8efbe74032bcc
Author: Yuming Wang <yumwang@...>
Date: 2018-01-15T13:49:34Z
[SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module
## What changes were proposed in this pull request?
Remove `MaxPermSize` for `sql` module
## How was this patch tested?
Manually tested.
Author: Yuming Wang <[email protected]>
Closes #20268 from wangyum/SPARK-19550-MaxPermSize.
(cherry picked from commit a38c887ac093d7cf343d807515147d87ca931ce7)
Signed-off-by: Sean Owen <[email protected]>
commit c6a3b9297f0246cfc02a57ec099ca23db90f343f
Author: gatorsmile <gatorsmile@...>
Date: 2018-01-15T14:32:38Z
[SPARK-23070] Bump previousSparkVersion in MimaBuild.scala to be 2.2.0
## What changes were proposed in this pull request?
Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 and add the
missing exclusions to `v23excludes` in `MimaExcludes`. No item can be
un-excluded in `v23excludes`.
## How was this patch tested?
The existing tests.
Author: gatorsmile <[email protected]>
Closes #20264 from gatorsmile/bump22.
(cherry picked from commit bd08a9e7af4137bddca638e627ad2ae531bce20f)
Signed-off-by: gatorsmile <[email protected]>
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]