[GitHub] spark pull request #20729: [SPARK-23578][ML]Add multicolumn support for Bina...

tengpeng Sat, 03 Mar 2018 21:52:15 -0800

GitHub user tengpeng opened a pull request:

    https://github.com/apache/spark/pull/20729


    [SPARK-23578][ML]Add multicolumn support for Binarizer

    [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
Based on this change, a multicolumn support is added for Binarizer.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tengpeng/spark Binarizer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20729
    
----
commit 9ca0f6eaf6744c090cab4ac6720cf11c9b83915e
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-11T13:32:36Z

    [SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite 
DataSourceWithHiveMetastoreCatalogSuite
    
    ## What changes were proposed in this pull request?
    The Spark 2.3 branch still failed due to the flaky test suite 
`DataSourceWithHiveMetastoreCatalogSuite `. 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
    
    Although https://github.com/apache/spark/pull/20207 is unable to reproduce 
it in Spark 2.3, it sounds like the current DB of Spark's Catalog is changed 
based on the following stacktrace. Thus, we just need to reset it.
    
    ```
    [info] DataSourceWithHiveMetastoreCatalogSuite:
    02:40:39.486 ERROR org.apache.hadoop.hive.ql.parse.CalcitePlanner: 
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:14 Table not found 't'
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1594)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1545)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10077)
        at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
        at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
        at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:694)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:683)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:683)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:673)
        at 
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:185)
        at 
org.apache.spark.sql.test.SQLTestUtilsBase$class.withTable(SQLTestUtils.scala:273)
        at 
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:139)
        at 
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:163)
        at 
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
        at 
org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
        at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
        at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
        at org.scalatest.Transformer.apply(Transformer.scala:22)
        at org.scalatest.Transformer.apply(Transformer.scala:20)
        at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
        at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
        at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
        at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
        at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
        at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
        at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
        at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
        at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
        at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
        at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
        at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
        at org.scalatest.Suite$class.run(Suite.scala:1147)
        at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
        at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
        at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
        at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
        at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
        at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
        at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
        at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
        at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
        at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
        at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
        at sbt.ForkMain$Run$2.call(ForkMain.java:296)
        at sbt.ForkMain$Run$2.call(ForkMain.java:286)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #20218 from gatorsmile/testFixAgain.
    
    (cherry picked from commit 76892bcf2c08efd7e9c5b16d377e623d82fe695e)
    Signed-off-by: gatorsmile <[email protected]>

commit f624850fe8acce52240217f376316734a23be00b
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-11T13:33:42Z

    [SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and 
fillna
    
    ## What changes were proposed in this pull request?
    https://github.com/apache/spark/pull/18164 introduces the behavior changes. 
We need to document it.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #20234 from gatorsmile/docBehaviorChange.
    
    (cherry picked from commit b46e58b74c82dac37b7b92284ea3714919c5a886)
    Signed-off-by: hyukjinkwon <[email protected]>

commit b94debd2b01b87ef1d2a34d48877e38ade0969e6
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-11T18:37:35Z

    [SPARK-22994][K8S] Use a single image for all Spark containers.
    
    This change allows a user to submit a Spark application on kubernetes
    having to provide a single image, instead of one image for each type
    of container. The image's entry point now takes an extra argument that
    identifies the process that is being started.
    
    The configuration still allows the user to provide different images
    for each container type if they so desire.
    
    On top of that, the entry point was simplified a bit to share more
    code; mainly, the same env variable is used to propagate the user-defined
    classpath to the different containers.
    
    Aside from being modified to match the new behavior, the
    'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh'
    to more closely match its purpose; the old name was a little awkward
    and now also not entirely correct, since there is a single image. It
    was also moved to 'bin' since it's not necessarily an admin tool.
    
    Docs have been updated to match the new behavior.
    
    Tested locally with minikube.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20192 from vanzin/SPARK-22994.
    
    (cherry picked from commit 0b2eefb674151a0af64806728b38d9410da552ec)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit f891ee3249e04576dd579cbab6f8f1632550e6bd
Author: Jose Torres <jose@...>
Date:   2018-01-11T18:52:12Z

    [SPARK-22908] Add kafka source and sink for continuous processing.
    
    ## What changes were proposed in this pull request?
    
    Add kafka source and sink for continuous processing. This involves two 
small changes to the execution engine:
    
    * Bring data reader close() into the normal data reader thread to avoid 
thread safety issues.
    * Fix up the semantics of the RECONFIGURING StreamExecution state. State 
updates are now atomic, and we don't have to deal with swallowing an exception.
    
    ## How was this patch tested?
    
    new unit tests
    
    Author: Jose Torres <[email protected]>
    
    Closes #20096 from jose-torres/continuous-kafka.
    
    (cherry picked from commit 6f7aaed805070d29dcba32e04ca7a1f581fa54b9)
    Signed-off-by: Tathagata Das <[email protected]>

commit 2ec302658c98038962c9b7a90fd2cff751a35ffa
Author: Bago Amirbekian <bago@...>
Date:   2018-01-11T21:57:15Z

    [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline
    
    ## What changes were proposed in this pull request?
    
    Including VectorSizeHint in RFormula piplelines will allow them to be 
applied to streaming dataframes.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Bago Amirbekian <[email protected]>
    
    Closes #20238 from MrBago/rFormulaVectorSize.
    
    (cherry picked from commit 186bf8fb2e9ff8a80f3f6bcb5f2a0327fa79a1c9)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-11T23:23:10Z

    Preparing Spark release v2.3.0-rc1

commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-11T23:23:17Z

    Preparing development version 2.3.1-SNAPSHOT

commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu <weichen.xu@...>
Date:   2018-01-12T00:20:30Z

    [SPARK-23008][ML] OnehotEncoderEstimator python API
    
    ## What changes were proposed in this pull request?
    
    OnehotEncoderEstimator python API.
    
    ## How was this patch tested?
    
    doctest
    
    Author: WeichenXu <[email protected]>
    
    Closes #20209 from WeichenXu123/ohe_py.
    
    (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj <ho3rexqj@...>
Date:   2018-01-12T07:27:00Z

    [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances 
of broadcast variable values
    
    When resources happen to be constrained on an executor the first time a 
broadcast variable is instantiated it is persisted to disk by the BlockManager. 
Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock 
from other instances of that broadcast variable spawns another instance of the 
underlying value. That is, broadcast variables are spawned once per executor 
**unless** memory is constrained, in which case every instance of a broadcast 
variable is provided with a unique copy of the underlying value.
    
    This patch fixes the above by explicitly caching the underlying values 
using weak references in a ReferenceMap.
    
    Author: ho3rexqj <[email protected]>
    
    Closes #20183 from ho3rexqj/fix/cache-broadcast-values.
    
    (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
    Signed-off-by: Wenchen Fan <[email protected]>

commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu <weichen.xu@...>
Date:   2018-01-12T09:27:02Z

    [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated
    
    ## What changes were proposed in this pull request?
    
    mark OneHotEncoder python API deprecated
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <[email protected]>
    
    Closes #20241 from WeichenXu123/mark_ohe_deprecated.
    
    (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
    Signed-off-by: Nick Pentreath <[email protected]>

commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-12T10:04:44Z

    [SPARK-23025][SQL] Support Null type in scala reflection
    
    ## What changes were proposed in this pull request?
    
    Add support for `Null` type in the `schemaFor` method for Scala reflection.
    
    ## How was this patch tested?
    
    Added UT
    
    Author: Marco Gaido <[email protected]>
    
    Closes #20219 from mgaido91/SPARK-23025.
    
    (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
    Signed-off-by: gatorsmile <[email protected]>

commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-12T18:18:42Z

    [MINOR][BUILD] Fix Java linter errors
    
    ## What changes were proposed in this pull request?
    
    This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, 
this will be the final one.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks failed at following occurrences:
    [ERROR] 
src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] 
(sizes) LineLength: Line is longer than 100 characters (found 101).
    [ERROR] 
src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] 
(imports) UnusedImports: Unused import - java.io.IOException.
    [ERROR] 
src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9]
 (modifier) ModifierOrder: 'private' modifier out of order with the JLS 
suggestions.
    [ERROR] 
src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java:[464] (sizes) 
LineLength: Line is longer than 100 characters (found 102).
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks passed.
    ```
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #20242 from dongjoon-hyun/fix_lint_java_2.3_rc1.
    
    (cherry picked from commit 7bd14cfd40500a0b6462cda647bdbb686a430328)
    Signed-off-by: Sameer Agarwal <[email protected]>

commit 02176f4c2f60342068669b215485ffd443346aed
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-12T19:25:37Z

    [SPARK-22975][SS] MetricsReporter should not throw exception when there was 
no progress reported
    
    ## What changes were proposed in this pull request?
    
    `MetricsReporter ` assumes that there has been some progress for the query, 
ie. `lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.
    
    The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <[email protected]>
    
    Closes #20189 from mgaido91/SPARK-22975.
    
    (cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 60bcb4685022c29a6ddcf707b505369687ec7da6
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-12T23:07:14Z

    Revert "[SPARK-22908] Add kafka source and sink for continuous processing."
    
    This reverts commit f891ee3249e04576dd579cbab6f8f1632550e6bd.

commit ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-13T07:13:44Z

    [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each 
batch within scalar Pandas UDF
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note that saying the length of a scalar Pandas 
UDF's `Series` is not of the whole input column but of the batch.
    
    We are fine for a group map UDF because the usage is different from our 
typical UDF but scalar UDFs might cause confusion with the normal UDF.
    
    For example, please consider this example:
    
    ```python
    from pyspark.sql.functions import pandas_udf, col, lit
    
    df = spark.range(1)
    f = pandas_udf(lambda x, y: len(x) + y, LongType())
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 1|
    +------------------+
    ```
    
    ```python
    from pyspark.sql.functions import udf, col, lit
    
    df = spark.range(1)
    f = udf(lambda x, y: len(x) + y, "long")
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 4|
    +------------------+
    ```
    
    ## How was this patch tested?
    
    Manually built the doc and checked the output.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #20237 from HyukjinKwon/SPARK-22980.
    
    (cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3)
    Signed-off-by: hyukjinkwon <[email protected]>

commit 801ffd799922e1c2751d3331874b88a67da8cf01
Author: Yuming Wang <yumwang@...>
Date:   2018-01-13T16:01:44Z

    [SPARK-22870][CORE] Dynamic allocation should allow 0 idle time
    
    ## What changes were proposed in this pull request?
    
    This pr to make `0` as a valid value for 
`spark.dynamicAllocation.executorIdleTimeout`.
    For details, see the jira description: 
https://issues.apache.org/jira/browse/SPARK-22870.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Yuming Wang <[email protected]>
    Author: Yuming Wang <[email protected]>
    
    Closes #20080 from wangyum/SPARK-22870.
    
    (cherry picked from commit fc6fe8a1d0f161c4788f3db94de49a8669ba3bcc)
    Signed-off-by: Sean Owen <[email protected]>

commit 8d32ed5f281317ba380aa6b8b3f3f041575022cb
Author: xubo245 <601450868@...>
Date:   2018-01-13T18:28:57Z

    [SPARK-23036][SQL][TEST] Add withGlobalTempView for testing
    
    ## What changes were proposed in this pull request?
    
    Add withGlobalTempView when create global temp view, like withTempView and 
withView.
    And correct some improper usage.
    Please see jira.
    There are other similar place like that. I will fix it if community need. 
Please confirm it.
    ## How was this patch tested?
    
    no new test.
    
    Author: xubo245 <[email protected]>
    
    Closes #20228 from xubo245/DropTempView.
    
    (cherry picked from commit bd4a21b4820c4ebaf750131574a6b2eeea36907e)
    Signed-off-by: gatorsmile <[email protected]>

commit 0fc5533e53ad03eb67590ddd231f40c2713150c3
Author: CodingCat <zhunansjtu@...>
Date:   2018-01-13T18:36:32Z

    [SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's 
size
    
    ## What changes were proposed in this pull request?
    
    as per discussion in 
https://github.com/apache/spark/pull/19864#discussion_r156847927
    
    the current HadoopFsRelation is purely based on the underlying file size 
which is not accurate and makes the execution vulnerable to errors like OOM
    
    Users can enable CBO with the functionalities in 
https://github.com/apache/spark/pull/19864 to avoid this issue
    
    This JIRA proposes to add a configurable factor to sizeInBytes method in 
HadoopFsRelation class so that users can mitigate this problem without CBO
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: CodingCat <[email protected]>
    Author: Nan Zhu <[email protected]>
    
    Closes #20072 from CodingCat/SPARK-22790.
    
    (cherry picked from commit ba891ec993c616dc4249fc786c56ea82ed04a827)
    Signed-off-by: gatorsmile <[email protected]>

commit bcd87ae0775d16b7c3b9de0c4f2db36eb3679476
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-13T21:39:38Z

    [SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in 
compareAndGetNewStats
    
    ## What changes were proposed in this pull request?
    This pr fixed code to compare values in `compareAndGetNewStats`.
    The test below fails in the current master;
    ```
        val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 
2)
        val newStats5 = CommandUtils.compareAndGetNewStats(
          Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None)
        assert(newStats5.isEmpty)
    ```
    
    ## How was this patch tested?
    Added some tests in `CommandUtilsSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20245 from maropu/SPARK-21213-FOLLOWUP.
    
    (cherry picked from commit 0066d6f6fa604817468471832968d4339f71c5cb)
    Signed-off-by: gatorsmile <[email protected]>

commit 1f4a08b15ab47cf6c3bb08c783497422f30d0709
Author: foxish <ramanathana@...>
Date:   2018-01-14T05:34:28Z

    [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of 
other misses)
    
    ## What changes were proposed in this pull request?
    
    Including the `-Pkubernetes` flag in a few places it was missed.
    
    ## How was this patch tested?
    
    checkstyle, mima through manual tests.
    
    Author: foxish <[email protected]>
    
    Closes #20256 from foxish/SPARK-23063.
    
    (cherry picked from commit c3548d11c3c57e8f2c6ebd9d2d6a3924ddcd3cba)
    Signed-off-by: Felix Cheung <[email protected]>

commit a335a49ce4672b44e5f818145214040a67c722ba
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-14T07:26:12Z

    [SPARK-23038][TEST] Update docker/spark-test (JDK/OS)
    
    ## What changes were proposed in this pull request?
    
    This PR aims to update the followings in `docker/spark-test`.
    
    - JDK7 -> JDK8
    Spark 2.2+ supports JDK8 only.
    
    - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel)
    The end of life of `precise` was April 28, 2017.
    
    ## How was this patch tested?
    
    Manual.
    
    * Master
    ```
    $ cd external/docker
    $ ./build
    $ export SPARK_HOME=...
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-master
    CONTAINER_IP=172.17.0.3
    ...
    18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and 
started at http://172.17.0.3:8080
    18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066.
    18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for 
submitting applications on port 6066
    18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE
    ```
    
    * Slave
    ```
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker 
spark://172.17.0.3:7077
    CONTAINER_IP=172.17.0.4
    ...
    18/01/11 06:51:54 INFO Worker: Successfully registered with master 
spark://172.17.0.3:7077
    ```
    
    After slave starts, master will show
    ```
    18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 
cores, 1024.0 MB RAM
    ```
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #20230 from dongjoon-hyun/SPARK-23038.
    
    (cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d)
    Signed-off-by: Felix Cheung <[email protected]>

commit 0d425c3362dc648d5c85b2b07af1a9df23b6d422
Author: Felix Cheung <felixcheung_m@...>
Date:   2018-01-14T10:43:10Z

    [SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text
    
    ## What changes were proposed in this pull request?
    
    fix doc truncated
    
    ## How was this patch tested?
    
    manually
    
    Author: Felix Cheung <[email protected]>
    
    Closes #20263 from felixcheung/r23docfix.
    
    (cherry picked from commit 66738d29c59871b29d26fc3756772b95ef536248)
    Signed-off-by: hyukjinkwon <[email protected]>

commit 5fbbd94d509dbbcfa1fe940569049f72ff4a6e89
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-14T14:26:21Z

    [SPARK-23021][SQL] AnalysisBarrier should override innerChildren to print 
correct explain output
    
    ## What changes were proposed in this pull request?
    `AnalysisBarrier` in the current master cuts off explain results for parsed 
logical plans;
    ```
    scala> Seq((1, 1)).toDF("a", 
"b").groupBy("a").count().sample(0.1).explain(true)
    == Parsed Logical Plan ==
    Sample 0.0, 0.1, false, -7661439431999668039
    +- AnalysisBarrier Aggregate [a#5], [a#5, count(1) AS count#14L]
    ```
    To fix this, `AnalysisBarrier` needs to override `innerChildren` and this 
pr changed the output to;
    ```
    == Parsed Logical Plan ==
    Sample 0.0, 0.1, false, -5086223488015741426
    +- AnalysisBarrier
          +- Aggregate [a#5], [a#5, count(1) AS count#14L]
             +- Project [_1#2 AS a#5, _2#3 AS b#6]
                +- LocalRelation [_1#2, _2#3]
    ```
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20247 from maropu/SPARK-23021-2.
    
    (cherry picked from commit 990f05c80347c6eec2ee06823cff587c9ea90b49)
    Signed-off-by: gatorsmile <[email protected]>

commit 9051e1a265dc0f1dc19fd27a0127ffa47f3ac245
Author: Sandor Murakozi <smurakozi@...>
Date:   2018-01-14T14:32:35Z

    [SPARK-23051][CORE] Fix for broken job description in Spark UI
    
    ## What changes were proposed in this pull request?
    
    In 2.2, Spark UI displayed the stage description if the job description was 
not set. This functionality was broken, the GUI has shown no description in 
this case. In addition, the code uses jobName and
    jobDescription instead of stageName and stageDescription when 
JobTableRowData is created.
    
    In this PR the logic producing values for the job rows was modified to find 
the latest stage attempt for the job and use that as a fallback if job 
description was missing.
    StageName and stageDescription are also set using values from stage and 
jobName/description is used only as a fallback.
    
    ## How was this patch tested?
    Manual testing of the UI, using the code in the bug report.
    
    Author: Sandor Murakozi <[email protected]>
    
    Closes #20251 from smurakozi/SPARK-23051.
    
    (cherry picked from commit 60eeecd7760aee6ce2fd207c83ae40054eadaf83)
    Signed-off-by: Sean Owen <[email protected]>

commit 2879236b92b5712b7438b972404375bbf1993df8
Author: guoxiaolong <guo.xiaolong1@...>
Date:   2018-01-14T18:02:49Z

    [SPARK-22999][SQL] show databases like command' can remove the like keyword
    
    ## What changes were proposed in this pull request?
    
    SHOW DATABASES (LIKE pattern = STRING)? Can be like the back increase?
    When using this command, LIKE keyword can be removed.
    You can refer to the SHOW TABLES command, SHOW TABLES 'test *' and SHOW 
TABELS like 'test *' can be used.
    Similarly SHOW DATABASES 'test *' and SHOW DATABASES like 'test *' can be 
used.
    
    ## How was this patch tested?
    unit tests   manual tests
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: guoxiaolong <[email protected]>
    
    Closes #20194 from guoxiaolongzte/SPARK-22999.
    
    (cherry picked from commit 42a1a15d739890bdfbb367ef94198b19e98ffcb7)
    Signed-off-by: gatorsmile <[email protected]>

commit 30574fd3716dbdf553cfd0f4d33164ab8fbccb77
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-15T02:55:21Z

    [SPARK-23054][SQL] Fix incorrect results of casting UserDefinedType to 
String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting `UserDefinedType`s into strings;
    ```
    >>> from pyspark.ml.classification import MultilayerPerceptronClassifier
    >>> from pyspark.ml.linalg import Vectors
    >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, 
Vectors.dense([0.0, 1.0]))], ["label", "features"])
    >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
    +-------------------------------------------+
    |features                                   |
    +-------------------------------------------+
    |[6,1,0,0,2800000020,2,0,0,0]               |
    |[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
    +-------------------------------------------+
    ```
    The root cause is that `Cast` handles input data as 
`UserDefinedType.sqlType`(this is underlying storage type), so we should pass 
data into `UserDefinedType.deserialize` then `toString`.
    This pr modified the result into;
    ```
    +---------+
    |features |
    +---------+
    |[0.0,0.0]|
    |[0.0,1.0]|
    +---------+
    ```
    
    ## How was this patch tested?
    Added tests in `UserDefinedTypeSuite `.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20246 from maropu/SPARK-23054.
    
    (cherry picked from commit b98ffa4d6dabaf787177d3f14b200fc4b118c7ce)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 81b989903af0cdcb6c19e6e8e7bdbac455a2c281
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-15T04:06:56Z

    [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` should work for ORC 
files
    
    ## What changes were proposed in this pull request?
    
    When `spark.sql.files.ignoreCorruptFiles=true`, we should ignore corrupted 
ORC files.
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added test case.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #20240 from dongjoon-hyun/SPARK-23049.
    
    (cherry picked from commit 9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 188999a3401357399d8d2b30f440d8b0b0795fc5
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-15T08:26:52Z

    [SPARK-23023][SQL] Cast field data to strings in showString
    
    ## What changes were proposed in this pull request?
    The current `Datset.showString` prints rows thru `RowEncoder` deserializers 
like;
    ```
    scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
    +------------------------------------------------------------+
    |a                                                           |
    +------------------------------------------------------------+
    |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
    +------------------------------------------------------------+
    ```
    This result is incorrect because the correct one is;
    ```
    scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
    +------------------------+
    |a                       |
    +------------------------+
    |[[1, 2], [3], [4, 5, 6]]|
    +------------------------+
    ```
    So, this pr fixed code in `showString` to cast field data to strings before 
printing.
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20214 from maropu/SPARK-23023.
    
    (cherry picked from commit b59808385cfe24ce768e5b3098b9034e64b99a5a)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 3491ca4fb5c2e3fecd727f7a31b8efbe74032bcc
Author: Yuming Wang <yumwang@...>
Date:   2018-01-15T13:49:34Z

    [SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module
    
    ## What changes were proposed in this pull request?
    
    Remove `MaxPermSize` for `sql` module
    
    ## How was this patch tested?
    
    Manually tested.
    
    Author: Yuming Wang <[email protected]>
    
    Closes #20268 from wangyum/SPARK-19550-MaxPermSize.
    
    (cherry picked from commit a38c887ac093d7cf343d807515147d87ca931ce7)
    Signed-off-by: Sean Owen <[email protected]>

commit c6a3b9297f0246cfc02a57ec099ca23db90f343f
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-15T14:32:38Z

    [SPARK-23070] Bump previousSparkVersion in MimaBuild.scala to be 2.2.0
    
    ## What changes were proposed in this pull request?
    Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 and add the 
missing exclusions to `v23excludes` in `MimaExcludes`. No item can be 
un-excluded in `v23excludes`.
    
    ## How was this patch tested?
    The existing tests.
    
    Author: gatorsmile <[email protected]>
    
    Closes #20264 from gatorsmile/bump22.
    
    (cherry picked from commit bd08a9e7af4137bddca638e627ad2ae531bce20f)
    Signed-off-by: gatorsmile <[email protected]>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20729: [SPARK-23578][ML]Add multicolumn support for Bina...

Reply via email to