[GitHub] spark pull request #20540: Branch 2.3

zhangzg187 Wed, 07 Feb 2018 21:12:51 -0800

GitHub user zhangzg187 opened a pull request:

    https://github.com/apache/spark/pull/20540


    Branch 2.3

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20540.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20540
    
----
commit cd92913f345c8d932d3c651626c7f803e6abdcdb
Author: jerryshao <sshao@...>
Date:   2018-01-04T19:39:42Z

    [SPARK-21475][CORE][2ND ATTEMPT] Change to use NIO's Files API for external 
shuffle service
    
    ## What changes were proposed in this pull request?
    
    This PR is the second attempt of #18684 , NIO's Files API doesn't override 
`skip` method for `InputStream`, so it will bring in performance issue 
(mentioned in #20119). But using `FileInputStream`/`FileOutputStream` will also 
bring in memory issue 
(https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful),
 which is severe for long running external shuffle service. So here in this 
proposal, only fixing the external shuffle service related code.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: jerryshao <[email protected]>
    
    Closes #20144 from jerryshao/SPARK-21475-v2.
    
    (cherry picked from commit 93f92c0ed7442a4382e97254307309977ff676f8)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit bc4bef472de0e99f74a80954d694c3d1744afe3a
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-04T22:19:00Z

    [SPARK-22850][CORE] Ensure queued events are delivered to all event queues.
    
    The code in LiveListenerBus was queueing events before start in the
    queues themselves; so in situations like the following:
    
       bus.post(someEvent)
       bus.addToEventLogQueue(listener)
       bus.start()
    
    "someEvent" would not be delivered to "listener" if that was the first
    listener in the queue, because the queue wouldn't exist when the
    event was posted.
    
    This change buffers the events before starting the bus in the bus itself,
    so that they can be delivered to all registered queues when the bus is
    started.
    
    Also tweaked the unit tests to cover the behavior above.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20039 from vanzin/SPARK-22850.
    
    (cherry picked from commit d2cddc88eac32f26b18ec26bb59e85c6f09a8c88)
    Signed-off-by: Imran Rashid <[email protected]>

commit 2ab4012adda941ebd637bd248f65cefdf4aaf110
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-04T23:00:09Z

    [SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20156 from vanzin/SPARK-22948.
    
    (cherry picked from commit 95f9659abe8845f9f3f42fd7ababd79e55c52489)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 84707f0c6afa9c5417e271657ff930930f82213c
Author: Yinan Li <liyinan926@...>
Date:   2018-01-04T23:35:20Z

    [SPARK-22953][K8S] Avoids adding duplicated secret volumes when 
init-container is used
    
    ## What changes were proposed in this pull request?
    
    User-specified secrets are mounted into both the main container and 
init-container (when it is used) in a Spark driver/executor pod, using the 
`MountSecretsBootstrap`. Because `MountSecretsBootstrap` always adds new secret 
volumes for the secrets to the pod, the same secret volumes get added twice, 
one when mounting the secrets to the main container, and the other when 
mounting the secrets to the init-container. This PR fixes the issue by 
separating `MountSecretsBootstrap.mountSecrets` out into two methods: 
`addSecretVolumes` for adding secret volumes to a pod and `mountSecrets` for 
mounting secret volumes to a container, respectively. `addSecretVolumes` is 
only called once for each pod, whereas `mountSecrets` is called individually 
for the main container and the init-container (if it is used).
    
    Ref: https://github.com/apache-spark-on-k8s/spark/issues/594.
    
    ## How was this patch tested?
    Unit tested and manually tested.
    
    vanzin This replaces https://github.com/apache/spark/pull/20148.
    hex108 foxish kimoonkim
    
    Author: Yinan Li <[email protected]>
    
    Closes #20159 from liyinan926/master.
    
    (cherry picked from commit e288fc87a027ec1e1a21401d1f151df20dbfecf3)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit ea9da6152af9223787cffd83d489741b4cc5aa34
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-05T00:34:56Z

    [SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly.
    
    - Make it possible to build images from a git clone.
    - Make it easy to use minikube to test things.
    
    Also fixed what seemed like a bug: the base image wasn't getting the tag
    provided in the command line. Adding the tag allows users to use multiple
    Spark builds in the same kubernetes cluster.
    
    Tested by deploying images on minikube and running spark-submit from a dev
    environment; also by building the images with different tags and verifying
    "docker images" in minikube.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20154 from vanzin/SPARK-22960.
    
    (cherry picked from commit 0428368c2c5e135f99f62be20877bbbda43be310)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 158f7e6a93b5acf4ce05c97b575124fd599cf927
Author: Juliusz Sompolski <julek@...>
Date:   2018-01-05T02:16:34Z

    [SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt
    
    ## What changes were proposed in this pull request?
    
    32bit Int was used for row rank.
    That overflowed in a dataframe with more than 2B rows.
    
    ## How was this patch tested?
    
    Added test, but ignored, as it takes 4 minutes.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #20152 from juliuszsompolski/SPARK-22957.
    
    (cherry picked from commit df7fc3ef3899cadd252d2837092bebe3442d6523)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 145820bda140d1385c4dd802fa79a871e6bf98be
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-05T06:02:21Z

    [SPARK-22825][SQL] Fix incorrect results of Casting Array to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting arrays into strings;
    ```
    scala> val df = 
spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))
    scala> df.write.saveAsTable("t")
    scala> sql("SELECT cast(ids as String) FROM t").show(false)
    +------------------------------------------------------------------+
    |ids                                                               |
    +------------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df|
    +------------------------------------------------------------------+
    ```
    
    This pr modified the result into;
    ```
    +------------------------------+
    |ids                           |
    +------------------------------+
    |[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]|
    +------------------------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite` and `SQLQuerySuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20024 from maropu/SPARK-22825.
    
    (cherry picked from commit 52fc5c17d9d784b846149771b398e741621c0b5c)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 5b524cc0cd5a82e4fb0681363b6641e40b37075d
Author: Bago Amirbekian <bago@...>
Date:   2018-01-05T06:45:15Z

    [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed 
memory tradeoff for TrainValidationSplit
    
    ## What changes were proposed in this pull request?
    
    Avoid holding all models in memory for `TrainValidationSplit`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bago Amirbekian <[email protected]>
    
    Closes #20143 from MrBago/trainValidMemoryFix.
    
    (cherry picked from commit cf0aa65576acbe0209c67f04c029058fd73555c1)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit f9dcdbcefb545ced3f5b457e1e88c88a8e180f9f
Author: Yinan Li <liyinan926@...>
Date:   2018-01-05T07:23:41Z

    [SPARK-22757][K8S] Enable spark.jars and spark.files in KUBERNETES mode
    
    ## What changes were proposed in this pull request?
    
    We missed enabling `spark.files` and `spark.jars` in 
https://github.com/apache/spark/pull/19954. The result is that remote 
dependencies specified through `spark.files` or `spark.jars` are not included 
in the list of remote dependencies to be downloaded by the init-container. This 
PR fixes it.
    
    ## How was this patch tested?
    
    Manual tests.
    
    vanzin This replaces https://github.com/apache/spark/pull/20157.
    
    foxish
    
    Author: Yinan Li <[email protected]>
    
    Closes #20160 from liyinan926/SPARK-22757.
    
    (cherry picked from commit 6cff7d19f6a905fe425bd6892fe7ca014c0e696b)
    Signed-off-by: Felix Cheung <[email protected]>

commit fd4e30476894b7c37cc2ae6243a941f0bc90388d
Author: Adrian Ionescu <adrian@...>
Date:   2018-01-05T13:32:39Z

    [SPARK-22961][REGRESSION] Constant columns should generate 
QueryPlanConstraints
    
    ## What changes were proposed in this pull request?
    
    #19201 introduced the following regression: given something like 
`df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a 
constraint and infer filters from it when joins are involved, which may lead to 
noticeable performance degradation.
    
    This patch re-enables this optimization by picking up Aliases of Literals 
in Projection lists as constraints and making sure they're not treated as 
aliased columns.
    
    ## How was this patch tested?
    
    Unit test was added.
    
    Author: Adrian Ionescu <[email protected]>
    
    Closes #20155 from adrian-ionescu/constant_constraints.
    
    (cherry picked from commit 51c33bd0d402af9e0284c6cbc0111f926446bfba)
    Signed-off-by: gatorsmile <[email protected]>

commit 0a30e93507ba784729a498943e7eeda1d6f19fbf
Author: Bruce Robbins <bersprockets@...>
Date:   2018-01-05T17:58:28Z

    [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite should succeed on 
platforms that don't have wget
    
    ## What changes were proposed in this pull request?
    
    Modified HiveExternalCatalogVersionsSuite.scala to use Utils.doFetchFile to 
download different versions of Spark binaries rather than launching wget as an 
external process.
    
    On platforms that don't have wget installed, this suite fails with an error.
    
    cloud-fan : would you like to check this change?
    
    ## How was this patch tested?
    
    1) test-only of HiveExternalCatalogVersionsSuite on several platforms. 
Tested bad mirror, read timeout, and redirects.
    2) ./dev/run-tests
    
    Author: Bruce Robbins <[email protected]>
    
    Closes #20147 from bersprockets/SPARK-22940-alt.
    
    (cherry picked from commit c0b7424ecacb56d3e7a18acc11ba3d5e7be57c43)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit d1f422c1c12c8095e8522d1051a6e0e406748a3a
Author: Joseph K. Bradley <joseph@...>
Date:   2018-01-05T19:51:25Z

    [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
    
    ## What changes were proposed in this pull request?
    
    Follow-up cleanups for the OneHotEncoderEstimator PR.  See some discussion 
in the original PR: https://github.com/apache/spark/pull/19527 or read below 
for what this PR includes:
    * configedCategorySize: I reverted this to return an Array.  I realized the 
original setup (which I had recommended in the original PR) caused the whole 
model to be serialized in the UDF.
    * encoder: I reorganized the logic to show what I meant in the comment in 
the previous PR.  I think it's simpler but am open to suggestions.
    
    I also made some small style cleanups based on IntelliJ warnings.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #20132 from jkbradley/viirya-SPARK-13030.
    
    (cherry picked from commit 930b90a84871e2504b57ed50efa7b8bb52d3ba44)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 55afac4e7b4f655aa05c5bcaf7851bb1e7699dba
Author: Gera Shegalov <gera@...>
Date:   2018-01-06T01:25:28Z

    [SPARK-22914][DEPLOY] Register history.ui.port
    
    ## What changes were proposed in this pull request?
    
    Register spark.history.ui.port as a known spark conf to be used in 
substitution expressions even if it's not set explicitly.
    
    ## How was this patch tested?
    
    Added unit test to demonstrate the issue
    
    Author: Gera Shegalov <[email protected]>
    Author: Gera Shegalov <[email protected]>
    
    Closes #20098 from gerashegalov/gera/register-SHS-port-conf.
    
    (cherry picked from commit ea956833017fcbd8ed2288368bfa2e417a2251c5)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit bf853018cabcd3b3abf84bfe534d2981020b4a71
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-06T01:26:03Z

    [SPARK-22937][SQL] SQL elt output binary for binary inputs
    
    ## What changes were proposed in this pull request?
    This pr modified `elt` to output binary for binary inputs.
    `elt` in the current master always output data as a string. But, in some 
databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary 
(Also, this might be a small surprise).
    This pr is related to #19977.
    
    ## How was this patch tested?
    Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20135 from maropu/SPARK-22937.
    
    (cherry picked from commit e8af7e8aeca15a6107248f358d9514521ffdc6d3)
    Signed-off-by: gatorsmile <[email protected]>

commit 3e3e9386ed95435a2d1817653d1402c102e380dc
Author: Yinan Li <liyinan926@...>
Date:   2018-01-06T01:29:27Z

    [SPARK-22960][K8S] Revert use of ARG base_image in images
    
    ## What changes were proposed in this pull request?
    
    This PR reverts the `ARG base_image` before `FROM` in the images of driver, 
executor, and init-container, introduced in 
https://github.com/apache/spark/pull/20154. The reason is Docker versions 
before 17.06 do not support this use (`ARG` before `FROM`).
    
    ## How was this patch tested?
    
    Tested manually.
    
    vanzin foxish kimoonkim
    
    Author: Yinan Li <[email protected]>
    
    Closes #20170 from liyinan926/master.
    
    (cherry picked from commit bf65cd3cda46d5480bfcd13110975c46ca631972)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 7236914e5e7aeb4eb919530b6edbad70256cca52
Author: Li Jin <ice.xelloss@...>
Date:   2018-01-06T08:11:20Z

    [SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for 
non-deterministic cases
    
    ## What changes were proposed in this pull request?
    
    Add tests for using non deterministic UDFs in aggregate.
    
    Update pandas_udf docstring w.r.t to determinism.
    
    ## How was this patch tested?
    test_nondeterministic_udf_in_aggregate
    
    Author: Li Jin <[email protected]>
    
    Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.
    
    (cherry picked from commit f2dd8b923759e8771b0e5f59bfa7ae4ad7e6a339)
    Signed-off-by: gatorsmile <[email protected]>

commit e6449e8167776e3921c286d75e8cdd30ee33d77a
Author: zuotingbing <zuo.tingbing9@...>
Date:   2018-01-06T10:07:45Z

    [SPARK-22793][SQL] Memory leak in Spark Thrift Server
    
    # What changes were proposed in this pull request?
    1. Start HiveThriftServer2.
    2. Connect to thriftserver through beeline.
    3. Close the beeline.
    4. repeat step2 and step 3 for many times.
    we found there are many directories never be dropped under the path 
`hive.exec.local.scratchdir` and `hive.exec.scratchdir`, as we know the 
scratchdir has been added to deleteOnExit when it be created. So it means that 
the cache size of FileSystem `deleteOnExit` will keep increasing until JVM 
terminated.
    
    In addition, we use `jmap -histo:live [PID]`
    to printout the size of objects in HiveThriftServer2 Process, we can find 
the object `org.apache.spark.sql.hive.client.HiveClientImpl` and 
`org.apache.hadoop.hive.ql.session.SessionState` keep increasing even though we 
closed all the beeline connections, which may caused the leak of Memory.
    
    # How was this patch tested?
    manual tests
    
    This PR follw-up the https://github.com/apache/spark/pull/19989
    
    Author: zuotingbing <[email protected]>
    
    Closes #20029 from zuotingbing/SPARK-22793.
    
    (cherry picked from commit be9a804f2ef77a5044d3da7d9374976daf59fc16)
    Signed-off-by: gatorsmile <[email protected]>

commit 0377755985897c850dcf3a938520e5c10de4040a
Author: fjh100456 <fu.jinhua6@...>
Date:   2018-01-06T10:19:57Z

    [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.
    
    [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 
'ParquetOptions', `parquet.compression` needs to be considered.
    
    ## What changes were proposed in this pull request?
    Since Hive 1.1, Hive allows users to set parquet compression codec via 
table-level properties parquet.compression. See the JIRA: 
https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression 
for ORC. Thus, for external users, it is more straightforward to support both. 
See the stackflow question: 
https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
    In Spark side, our table-level compression conf compression was added by 
#11464 since Spark 2.0.
    We need to support both table-level conf. Users might also use 
session-level conf spark.sql.parquet.compression.codec. The priority rule will 
be like
    If other compression codec configuration was found through hive or parquet, 
the precedence would be compression, parquet.compression, 
spark.sql.parquet.compression.codec. Acceptable values include: none, 
uncompressed, snappy, gzip, lzo.
    The rule for Parquet is consistent with the ORC after the change.
    
    Changes:
    1.Increased acquiring 'compressionCodecClassName' from 
`parquet.compression`,and the precedence order is 
`compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just 
like what we do in `OrcOptions`.
    
    2.Change `spark.sql.parquet.compression.codec` to support "none".Actually 
in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but 
it does not allowed to configured to "none".
    
    3.Change `compressionCode` to `compressionCodecClassName`.
    
    ## How was this patch tested?
    Add test.
    
    Author: fjh100456 <[email protected]>
    
    Closes #20076 from fjh100456/ParquetOptionIssue.
    
    (cherry picked from commit 7b78041423b6ee330def2336dfd1ff9ae8469c59)
    Signed-off-by: gatorsmile <[email protected]>

commit b66700a5ed9a6e31433bec7961361c382a8b4162
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-06T15:08:26Z

    [SPARK-22901][PYTHON][FOLLOWUP] Adds the doc for asNondeterministic for 
wrapped UDF function
    
    ## What changes were proposed in this pull request?
    
    This PR wraps the `asNondeterministic` attribute in the wrapped UDF 
function to set the docstring properly.
    
    ```python
    from pyspark.sql.functions import udf
    help(udf(lambda x: x).asNondeterministic)
    ```
    
    Before:
    
    ```
    Help on function <lambda> in module pyspark.sql.udf:
    
    <lambda> lambda
    (END
    ```
    
    After:
    
    ```
    Help on function asNondeterministic in module pyspark.sql.udf:
    
    asNondeterministic()
        Updates UserDefinedFunction to nondeterministic.
    
        .. versionadded:: 2.3
    (END)
    ```
    
    ## How was this patch tested?
    
    Manually tested and a simple test was added.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #20173 from HyukjinKwon/SPARK-22901-followup.

commit f9e7b0c8aa9334fa2e467b26516ac8e54a51dc63
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-06T16:19:21Z

    [HOTFIX] Fix style checking failure
    
    ## What changes were proposed in this pull request?
    This PR is to fix the  style checking failure.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #20175 from gatorsmile/stylefix.
    
    (cherry picked from commit 9a7048b2889bd0fd66e68a0ce3e07e466315a051)
    Signed-off-by: gatorsmile <[email protected]>

commit 285d342c406cf304931844665e56725ef1a848e0
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-07T05:42:01Z

    [SPARK-22973][SQL] Fix incorrect results of Casting Map to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting maps into strings;
    ```
    scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t")
    scala> sql("SELECT cast(a as String) FROM t").show(false)
    +----------------------------------------------------------------+
    |a                                                               |
    +----------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeMapData38bdd75d|
    +----------------------------------------------------------------+
    ```
    This pr modified the result into;
    ```
    +----------------+
    |a               |
    +----------------+
    |[1 -> a, 2 -> b]|
    +----------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20166 from maropu/SPARK-22973.
    
    (cherry picked from commit 18e94149992618a2b4e6f0fd3b3f4594e1745224)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 7673e9c569821e20c56c2a9a875bcc025d36315f
Author: Josh Rosen <joshrosen@...>
Date:   2018-01-08T03:39:45Z

    [SPARK-22985] Fix argument escaping bug in from_utc_timestamp / 
to_utc_timestamp codegen
    
    ## What changes were proposed in this pull request?
    
    This patch adds additional escaping in `from_utc_timestamp` / 
`to_utc_timestamp` expression codegen in order to a bug where invalid timezones 
which contain special characters could cause generated code to fail to compile.
    
    ## How was this patch tested?
    
    New regression tests in `DateExpressionsSuite`.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #20182 from 
JoshRosen/SPARK-22985-fix-utc-timezone-function-escaping-bugs.
    
    (cherry picked from commit 71d65a32158a55285be197bec4e41fedc9225b94)
    Signed-off-by: gatorsmile <[email protected]>

commit a1d3352d279d9eec6de23df83398b12454c6564e
Author: Guilherme Berger <gberger@...>
Date:   2018-01-08T05:32:05Z

    [SPARK-22566][PYTHON] Better error message for `_merge_type` in Pandas to 
Spark DF conversion
    
    ## What changes were proposed in this pull request?
    
    It provides a better error message when doing 
`spark_session.createDataFrame(pandas_df)` with no schema and an error occurs 
in the schema inference due to incompatible types.
    
    The Pandas column names are propagated down and the error message mentions 
which column had the merging error.
    
    https://issues.apache.org/jira/browse/SPARK-22566
    
    ## How was this patch tested?
    
    Manually in the `./bin/pyspark` console, and with new tests: 
`./python/run-tests`
    
    <img width="873" alt="screen shot 2017-11-21 at 13 29 49" 
src="https://user-images.githubusercontent.com/3977115/33080121-382274e0-cecf-11e7-808f-057a65bb7b00.png";>
    
    I state that the contribution is my original work and that I license the 
work to the Apache Spark project under the projectâs open source license.
    
    Author: Guilherme Berger <[email protected]>
    
    Closes #19792 from gberger/master.
    
    (cherry picked from commit 3e40eb3f1ffac3d2f49459a801e3ce171ed34091)
    Signed-off-by: Takuya UESHIN <[email protected]>

commit 8bf24e9fea9e9e23f03caf8b32acb4a64f5b00e3
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-08T05:59:08Z

    [SPARK-22979][PYTHON][SQL] Avoid per-record type dispatch in Python data 
conversion (EvaluatePython.fromJava)
    
    ## What changes were proposed in this pull request?
    
    Seems we can avoid type dispatch for each value when Java objection (from 
Pyrolite) -> Spark's internal data format because we know the schema ahead.
    
    I manually performed the benchmark as below:
    
    ```scala
      test("EvaluatePython.fromJava / EvaluatePython.makeFromJava") {
        val numRows = 1000 * 1000
        val numFields = 30
    
        val random = new Random(System.nanoTime())
        val types = Array(
          BooleanType, ByteType, FloatType, DoubleType, IntegerType, LongType, 
ShortType,
          DecimalType.ShortDecimal, DecimalType.IntDecimal, 
DecimalType.ByteDecimal,
          DecimalType.FloatDecimal, DecimalType.LongDecimal, new DecimalType(5, 
2),
          new DecimalType(12, 2), new DecimalType(30, 10), CalendarIntervalType)
        val schema = RandomDataGenerator.randomSchema(random, numFields, types)
        val rows = mutable.ArrayBuffer.empty[Array[Any]]
        var i = 0
        while (i < numRows) {
          val row = RandomDataGenerator.randomRow(random, schema)
          rows += row.toSeq.toArray
          i += 1
        }
    
        val benchmark = new Benchmark("EvaluatePython.fromJava / 
EvaluatePython.makeFromJava", numRows)
        benchmark.addCase("Before - EvaluatePython.fromJava", 3) { _ =>
          var i = 0
          while (i < numRows) {
            EvaluatePython.fromJava(rows(i), schema)
            i += 1
          }
        }
    
        benchmark.addCase("After - EvaluatePython.makeFromJava", 3) { _ =>
          val fromJava = EvaluatePython.makeFromJava(schema)
          var i = 0
          while (i < numRows) {
            fromJava(rows(i))
            i += 1
          }
        }
    
        benchmark.run()
      }
    ```
    
    ```
    EvaluatePython.fromJava / EvaluatePython.makeFromJava: Best/Avg Time(ms)    
Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Before - EvaluatePython.fromJava              1265 / 1346          0.8      
  1264.8       1.0X
    After - EvaluatePython.makeFromJava            571 /  649          1.8      
   570.8       2.2X
    ```
    
    If the structure is nested, I think the advantage should be larger than 
this.
    
    ## How was this patch tested?
    
    Existing tests should cover this. Also, I manually checked if the values 
from before / after are actually same via `assert` when performing the 
benchmarks.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #20172 from HyukjinKwon/type-dispatch-python-eval.
    
    (cherry picked from commit 8fdeb4b9946bd9be045abb919da2e531708b3bd4)
    
    Signed-off-by: Wenchen Fan <[email protected]>
    
    Signed-off-by: Wenchen Fan <[email protected]>

commit 6964dfe47b2090e542b26cd64e27420ec3eb1a3d
Author: Josh Rosen <joshrosen@...>
Date:   2018-01-08T08:04:03Z

    [SPARK-22983] Don't push filters beneath aggregates with empty grouping 
expressions
    
    ## What changes were proposed in this pull request?
    
    The following SQL query should return zero rows, but in Spark it actually 
returns one row:
    
    ```
    SELECT 1 from (
      SELECT 1 AS z,
      MIN(a.x)
      FROM (select 1 as x) a
      WHERE false
    ) b
    where b.z != b.z
    ```
    
    The problem stems from the `PushDownPredicate` rule: when this rule 
encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, 
it removes the original filter and adds a new filter onto Aggregate's child, 
e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a 
counterexample: because there is no explicit `GROUP BY`, we are implicitly 
computing a global aggregate over the entire table so the original filter was 
not acting like a `HAVING` clause filtering the number of groups: if we push 
this filter then it fails to actually reduce the cardinality of the Aggregate 
output, leading to the wrong answer.
    
    In 2016 I fixed a similar problem involving invalid pushdowns of 
data-independent filters (filters which reference no columns of the filtered 
relation). There was additional discussion after my fix was merged which 
pointed out that my patch was an incomplete fix (see #15289), but it looks I 
must have either misunderstood the comment or forgot to follow up on the 
additional points raised there.
    
    This patch fixes the problem by choosing to never push down filters in 
cases where there are no grouping expressions. Since there are no grouping 
keys, the only columns are aggregate columns and we can't push filters defined 
over aggregate results, so this change won't cause us to miss out on any 
legitimate pushdown opportunities.
    
    ## How was this patch tested?
    
    New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #20180 from 
JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.
    
    (cherry picked from commit 2c73d2a948bdde798aaf0f87c18846281deb05fd)
    Signed-off-by: gatorsmile <[email protected]>

commit 4a45f0a532216736f8874417c5cbd7912ca13db5
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-08T11:41:41Z

    [SPARK-21865][SQL] simplify the distribution semantic of Spark SQL
    
    ## What changes were proposed in this pull request?
    
    **The current shuffle planning logic**
    
    1. Each operator specifies the distribution requirements for its children, 
via the `Distribution` interface.
    2. Each operator specifies its output partitioning, via the `Partitioning` 
interface.
    3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a 
`Distribution`.
    4. For each operator, check each child of it, add a shuffle node above the 
child if the child partitioning can not satisfy the required distribution.
    5. For each operator, check if its children's output partitionings are 
compatible with each other, via the `Partitioning.compatibleWith`.
    6. If the check in 5 failed, add a shuffle above each child.
    7. try to eliminate the shuffles added in 6, via `Partitioning.guarantees`.
    
    This design has a major problem with the definition of "compatible".
    
    `Partitioning.compatibleWith` is not well defined, ideally a `Partitioning` 
can't know if it's compatible with other `Partitioning`, without more 
information from the operator. For example, `t1 join t2 on t1.a = t2.b`, 
`HashPartitioning(a, 10)` should be compatible with `HashPartitioning(b, 10)` 
under this case, but the partitioning itself doesn't know it.
    
    As a result, currently `Partitioning.compatibleWith` always return false 
except for literals, which make it almost useless. This also means, if an 
operator has distribution requirements for multiple children, Spark always add 
shuffle nodes to all the children(although some of them can be eliminated). 
However, there is no guarantee that the children's output partitionings are 
compatible with each other after adding these shuffles, we just assume that the 
operator will only specify `ClusteredDistribution` for multiple children.
    
    I think it's very hard to guarantee children co-partition for all kinds of 
operators, and we can not even give a clear definition about co-partition 
between distributions like `ClusteredDistribution(a,b)` and 
`ClusteredDistribution(c)`.
    
    I think we should drop the "compatible" concept in the distribution model, 
and let the operator achieve the co-partition requirement by special 
distribution requirements.
    
    **Proposed shuffle planning logic after this PR**
    (The first 4 are same as before)
    1. Each operator specifies the distribution requirements for its children, 
via the `Distribution` interface.
    2. Each operator specifies its output partitioning, via the `Partitioning` 
interface.
    3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a 
`Distribution`.
    4. For each operator, check each child of it, add a shuffle node above the 
child if the child partitioning can not satisfy the required distribution.
    5. For each operator, check if its children's output partitionings have the 
same number of partitions.
    6. If the check in 5 failed, pick the max number of partitions from 
children's output partitionings, and add shuffle to child whose number of 
partitions doesn't equal to the max one.
    
    The new distribution model is very simple, we only have one kind of 
relationship, which is `Partitioning.satisfy`. For multiple children, Spark 
only guarantees they have the same number of partitions, and it's the 
operator's responsibility to leverage this guarantee to achieve more 
complicated requirements. For example, non-broadcast joins can use the newly 
added `HashPartitionedDistribution` to achieve co-partition.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #19080 from cloud-fan/exchange.
    
    (cherry picked from commit eb45b52e826ea9cea48629760db35ef87f91fea0)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 06fd842e3a120fde1c137e4945bcb747fc71a322
Author: Xianjin YE <advancedxy@...>
Date:   2018-01-08T15:49:07Z

    [SPARK-22952][CORE] Deprecate stageAttemptId in favour of stageAttemptNumber
    
    ## What changes were proposed in this pull request?
    1.  Deprecate attemptId in StageInfo and add `def attemptNumber() = 
attemptId`
    2. Replace usage of stageAttemptId with stageAttemptNumber
    
    ## How was this patch tested?
    I manually checked the compiler warning info
    
    Author: Xianjin YE <[email protected]>
    
    Closes #20178 from advancedxy/SPARK-22952.
    
    (cherry picked from commit 40b983c3b44b6771f07302ce87987fa4716b5ebf)
    Signed-off-by: Wenchen Fan <[email protected]>

commit eecd83cb2d24907aba303095b052997471247500
Author: foxish <ramanathana@...>
Date:   2018-01-08T21:01:45Z

    [SPARK-22992][K8S] Remove assumption of the DNS domain
    
    ## What changes were proposed in this pull request?
    
    Remove the use of FQDN to access the driver because it assumes that it's 
set up in a DNS zone - `cluster.local` which is common but not ubiquitous
    Note that we already access the in-cluster API server through 
`kubernetes.default.svc`, so, by extension, this should work as well.
    The alternative is to introduce DNS zones for both of those addresses.
    
    ## How was this patch tested?
    Unit tests
    
    cc vanzin liyinan926 mridulm mccheah
    
    Author: foxish <[email protected]>
    
    Closes #20187 from foxish/cluster.local.
    
    (cherry picked from commit eed82a0b211352215316ec70dc48aefc013ad0b2)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 8032cf852fccd0ab8754f633affdc9ba8fc99e58
Author: xubo245 <601450868@...>
Date:   2018-01-09T02:15:01Z

    [SPARK-22972] Couldn't find corresponding Hive SerDe for data source 
provider org.apache.spark.sql.hive.orc
    
    ## What changes were proposed in this pull request?
    Fix the warning: Couldn't find corresponding Hive SerDe for data source 
provider org.apache.spark.sql.hive.orc.
    
    ## How was this patch tested?
     test("SPARK-22972: hive orc source")
        assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc")
          .equals(HiveSerDe.sourceToSerDe("orc")))
    
    Author: xubo245 <[email protected]>
    
    Closes #20165 from xubo245/HiveSerDe.
    
    (cherry picked from commit 68ce792b5857f0291154f524ac651036db868bb9)
    Signed-off-by: gatorsmile <[email protected]>

commit 850b9f39186665fd727737a98b29abe5236830db
Author: Wang Gengliang <ltnwgl@...>
Date:   2018-01-09T02:44:21Z

    [SPARK-22990][CORE] Fix method isFairScheduler in JobsTab and StagesTab
    
    ## What changes were proposed in this pull request?
    
    In current implementation, the function `isFairScheduler` is always false, 
since it is comparing String with `SchedulingMode`
    
    Author: Wang Gengliang <[email protected]>
    
    Closes #20186 from gengliangwang/isFairScheduler.
    
    (cherry picked from commit 849043ce1d28a976659278d29368da0799329db8)
    Signed-off-by: Wenchen Fan <[email protected]>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20540: Branch 2.3

Reply via email to