[GitHub] spark pull request #22301: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec...

fjh100456 Fri, 31 Aug 2018 01:40:54 -0700

GitHub user fjh100456 opened a pull request:

    https://github.com/apache/spark/pull/22301


    [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS

    What changes were proposed in this pull request?
    Since resolved by @dongjoon in  
[20522](https://github.com/apache/spark/pull/20522), compressionCodec test for 
CTAS has been able to support, the scenario of CTAS suggested by @gatorsmile in 
 [20087](https://github.com/apache/spark/pull/20087#discussion_r162252598)  
should be enabled.
    
    How was this patch tested?
    Add test.
    
    cc @gatorsmile @dongjoon-hyun 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fjh100456/spark CompressionCodecCommit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22301
    
----
commit 5244aafc2d7945c11c96398b8d5b752b45fd148c
Author: Xianjin YE <advancedxy@...>
Date:   2018-01-02T15:30:38Z

    [SPARK-22897][CORE] Expose stageAttemptId in TaskContext
    
    ## What changes were proposed in this pull request?
    stageAttemptId added in TaskContext and corresponding construction 
modification
    
    ## How was this patch tested?
    Added a new test in TaskContextSuite, two cases are tested:
    1. Normal case without failure
    2. Exception case with resubmitted stages
    
    Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897)
    
    Author: Xianjin YE <[email protected]>
    
    Closes #20082 from advancedxy/SPARK-22897.
    
    (cherry picked from commit a6fc300e91273230e7134ac6db95ccb4436c6f8f)
    Signed-off-by: Wenchen Fan <[email protected]>

commit b96a2132413937c013e1099be3ec4bc420c947fd
Author: Juliusz Sompolski <julek@...>
Date:   2018-01-03T13:40:51Z

    [SPARK-22938] Assert that SQLConf.get is accessed only on the driver.
    
    ## What changes were proposed in this pull request?
    
    Assert if code tries to access SQLConf.get on executor.
    This can lead to hard to detect bugs, where the executor will read 
fallbackConf, falling back to default config values, ignoring potentially 
changed non-default configs.
    If a config is to be passed to executor code, it needs to be read on the 
driver, and passed explicitly.
    
    ## How was this patch tested?
    
    Check in existing tests.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #20136 from juliuszsompolski/SPARK-22938.
    
    (cherry picked from commit 247a08939d58405aef39b2a4e7773aa45474ad12)
    Signed-off-by: Wenchen Fan <[email protected]>

commit a05e85ecb76091567a26a3a14ad0879b4728addc
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-03T14:09:30Z

    [SPARK-22934][SQL] Make optional clauses order insensitive for CREATE TABLE 
SQL statement
    
    ## What changes were proposed in this pull request?
    Currently, our CREATE TABLE syntax require the EXACT order of clauses. It 
is pretty hard to remember the exact order. Thus, this PR is to make optional 
clauses order insensitive for `CREATE TABLE` SQL statement.
    
    ```
    CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
        [(col_name1 col_type1 [COMMENT col_comment1], ...)]
        USING datasource
        [OPTIONS (key1=val1, key2=val2, ...)]
        [PARTITIONED BY (col_name1, col_name2, ...)]
        [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
        [LOCATION path]
        [COMMENT table_comment]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
        [AS select_statement]
    ```
    
    The proposal is to make the following clauses order insensitive.
    ```
        [OPTIONS (key1=val1, key2=val2, ...)]
        [PARTITIONED BY (col_name1, col_name2, ...)]
        [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
        [LOCATION path]
        [COMMENT table_comment]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
    ```
    
    The same idea is also applicable to Create Hive Table.
    ```
    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
        [(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
        [COMMENT table_comment]
        [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
        [ROW FORMAT row_format]
        [STORED AS file_format]
        [LOCATION path]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
        [AS select_statement]
    ```
    
    The proposal is to make the following clauses order insensitive.
    ```
        [COMMENT table_comment]
        [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
        [ROW FORMAT row_format]
        [STORED AS file_format]
        [LOCATION path]
        [TBLPROPERTIES (key1=val1, key2=val2, ...)]
    ```
    
    ## How was this patch tested?
    Added test cases
    
    Author: gatorsmile <[email protected]>
    
    Closes #20133 from gatorsmile/createDataSourceTableDDL.
    
    (cherry picked from commit 1a87a1609c4d2c9027a2cf669ea3337b89f61fb6)
    Signed-off-by: gatorsmile <[email protected]>

commit b96248862589bae1ddcdb14ce4c802789a001306
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-03T14:18:13Z

    [SPARK-20236][SQL] dynamic partition overwrite
    
    ## What changes were proposed in this pull request?
    
    When overwriting a partitioned table with dynamic partition columns, the 
behavior is different between data source and hive tables.
    
    data source table: delete all partition directories that match the static 
partition values provided in the insert statement.
    
    hive table: only delete partition directories which have data written into 
it
    
    This PR adds a new config to make users be able to choose hive's behavior.
    
    ## How was this patch tested?
    
    new tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #18714 from cloud-fan/overwrite-partition.
    
    (cherry picked from commit a66fe36cee9363b01ee70e469f1c968f633c5713)
    Signed-off-by: gatorsmile <[email protected]>

commit 27c949d673e45fdbbae0f2c08969b9d51222dd8d
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-02T01:19:18Z

    [SPARK-22932][SQL] Refactor AnalysisContext
    
    ## What changes were proposed in this pull request?
    Add a `reset` function to ensure the state in `AnalysisContext ` is 
per-query.
    
    ## How was this patch tested?
    The existing test cases
    
    Author: gatorsmile <[email protected]>
    
    Closes #20127 from gatorsmile/refactorAnalysisContext.

commit 79f7263daa5f83e2026fda9a8bbb1090a1333f80
Author: chetkhatri <ckhatrimanjal@...>
Date:   2018-01-03T17:31:32Z

    [SPARK-22896] Improvement in String interpolation
    
    ## What changes were proposed in this pull request?
    
    * String interpolation in ml pipeline example has been corrected as per 
scala standard.
    
    ## How was this patch tested?
    * manually tested.
    
    Author: chetkhatri <[email protected]>
    
    Closes #20070 from chetkhatri/mllib-chetan-contrib.
    
    (cherry picked from commit 9a2b65a3c0c36316aae0a53aa0f61c5044c2ceff)
    Signed-off-by: Sean Owen <[email protected]>

commit a51212b642f05f28447b80aa29f5482de2c27f58
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-03T23:28:53Z

    [SPARK-20960][SQL] make ColumnVector public
    
    ## What changes were proposed in this pull request?
    
    move `ColumnVector` and related classes to 
`org.apache.spark.sql.vectorized`, and improve the document.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #20116 from cloud-fan/column-vector.
    
    (cherry picked from commit b297029130735316e1ac1144dee44761a12bfba7)
    Signed-off-by: gatorsmile <[email protected]>

commit f51c8fde8bf08705bacf8a93b5dba685ebbcec17
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-04T05:14:52Z

    [SPARK-22944][SQL] improve FoldablePropagation
    
    ## What changes were proposed in this pull request?
    
    `FoldablePropagation` is a little tricky as it needs to handle attributes 
that are miss-derived from children, e.g. outer join outputs. This rule does a 
kind of stop-able tree transform, to skip to apply this rule when hit a node 
which may have miss-derived attributes.
    
    Logically we should be able to apply this rule above the unsupported nodes, 
by just treating the unsupported nodes as leaf nodes. This PR improves this 
rule to not stop the tree transformation, but reduce the foldable expressions 
that we want to propagate.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #20139 from cloud-fan/foldable.
    
    (cherry picked from commit 7d045c5f00e2c7c67011830e2169a4e130c3ace8)
    Signed-off-by: gatorsmile <[email protected]>

commit 1860a43e9affb7619be0a5a1c786e264d09bc446
Author: Felix Cheung <felixcheung_m@...>
Date:   2018-01-04T05:43:14Z

    [SPARK-22933][SPARKR] R Structured Streaming API for withWatermark, 
trigger, partitionBy
    
    ## What changes were proposed in this pull request?
    
    R Structured Streaming API for withWatermark, trigger, partitionBy
    
    ## How was this patch tested?
    
    manual, unit tests
    
    Author: Felix Cheung <[email protected]>
    
    Closes #20129 from felixcheung/rwater.
    
    (cherry picked from commit df95a908baf78800556636a76d58bba9b3dd943f)
    Signed-off-by: Felix Cheung <[email protected]>

commit a7cfd6beaf35f79a744047a4a09714ef1da60293
Author: Kent Yao <yaooqinn@...>
Date:   2018-01-04T11:10:10Z

    [SPARK-22950][SQL] Handle ChildFirstURLClassLoader's parent
    
    ## What changes were proposed in this pull request?
    
    ChildFirstClassLoader's parent is set to null, so we can't get jars from 
its parent. This will cause ClassNotFoundException during HiveClient 
initialization with builtin hive jars, where we may should use spark context 
loader instead.
    
    ## How was this patch tested?
    
    add new ut
    cc cloud-fan gatorsmile
    
    Author: Kent Yao <[email protected]>
    
    Closes #20145 from yaooqinn/SPARK-22950.
    
    (cherry picked from commit 9fa703e89318922393bae03c0db4575f4f4b4c56)
    Signed-off-by: Wenchen Fan <[email protected]>

commit eb99b8adecc050240ce9d5e0b92a20f018df465e
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-04T11:17:22Z

    [SPARK-22945][SQL] add java UDF APIs in the functions object
    
    ## What changes were proposed in this pull request?
    
    Currently Scala users can use UDF like
    ```
    val foo = udf((i: Int) => Math.random() + i).asNondeterministic
    df.select(foo('a))
    ```
    Python users can also do it with similar APIs. However Java users can't do 
it, we should add Java UDF APIs in the functions object.
    
    ## How was this patch tested?
    
    new tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #20141 from cloud-fan/udf.
    
    (cherry picked from commit d5861aba9d80ca15ad3f22793b79822e470d6913)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 1f5e3540c7535ceaea66ebd5ee2f598e8b3ba1a5
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-04T13:07:31Z

    [SPARK-22939][PYSPARK] Support Spark UDF in registerFunction
    
    ## What changes were proposed in this pull request?
    ```Python
    import random
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType
    random_udf = udf(lambda: int(random.random() * 100), 
IntegerType()).asNondeterministic()
    spark.catalog.registerFunction("random_udf", random_udf, StringType())
    spark.sql("SELECT random_udf()").collect()
    ```
    
    We will get the following error.
    ```
    Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
    py4j.Py4JException: Method __getnewargs__([]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:274)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    This PR is to support it.
    
    ## How was this patch tested?
    WIP
    
    Author: gatorsmile <[email protected]>
    
    Closes #20137 from gatorsmile/registerFunction.
    
    (cherry picked from commit 5aadbc929cb194e06dbd3bab054a161569289af5)
    Signed-off-by: gatorsmile <[email protected]>

commit bcfeef5a944d56af1a5106f5c07296ea2c262991
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-04T13:15:10Z

    [SPARK-22771][SQL] Add a missing return statement in 
Concat.checkInputDataTypes
    
    ## What changes were proposed in this pull request?
    This pr is a follow-up to fix a bug left in #19977.
    
    ## How was this patch tested?
    Added tests in `StringExpressionsSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20149 from maropu/SPARK-22771-FOLLOWUP.
    
    (cherry picked from commit 6f68316e98fad72b171df422566e1fc9a7bbfcde)
    Signed-off-by: gatorsmile <[email protected]>

commit cd92913f345c8d932d3c651626c7f803e6abdcdb
Author: jerryshao <sshao@...>
Date:   2018-01-04T19:39:42Z

    [SPARK-21475][CORE][2ND ATTEMPT] Change to use NIO's Files API for external 
shuffle service
    
    ## What changes were proposed in this pull request?
    
    This PR is the second attempt of #18684 , NIO's Files API doesn't override 
`skip` method for `InputStream`, so it will bring in performance issue 
(mentioned in #20119). But using `FileInputStream`/`FileOutputStream` will also 
bring in memory issue 
(https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful),
 which is severe for long running external shuffle service. So here in this 
proposal, only fixing the external shuffle service related code.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: jerryshao <[email protected]>
    
    Closes #20144 from jerryshao/SPARK-21475-v2.
    
    (cherry picked from commit 93f92c0ed7442a4382e97254307309977ff676f8)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit bc4bef472de0e99f74a80954d694c3d1744afe3a
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-04T22:19:00Z

    [SPARK-22850][CORE] Ensure queued events are delivered to all event queues.
    
    The code in LiveListenerBus was queueing events before start in the
    queues themselves; so in situations like the following:
    
       bus.post(someEvent)
       bus.addToEventLogQueue(listener)
       bus.start()
    
    "someEvent" would not be delivered to "listener" if that was the first
    listener in the queue, because the queue wouldn't exist when the
    event was posted.
    
    This change buffers the events before starting the bus in the bus itself,
    so that they can be delivered to all registered queues when the bus is
    started.
    
    Also tweaked the unit tests to cover the behavior above.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20039 from vanzin/SPARK-22850.
    
    (cherry picked from commit d2cddc88eac32f26b18ec26bb59e85c6f09a8c88)
    Signed-off-by: Imran Rashid <[email protected]>

commit 2ab4012adda941ebd637bd248f65cefdf4aaf110
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-04T23:00:09Z

    [SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20156 from vanzin/SPARK-22948.
    
    (cherry picked from commit 95f9659abe8845f9f3f42fd7ababd79e55c52489)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 84707f0c6afa9c5417e271657ff930930f82213c
Author: Yinan Li <liyinan926@...>
Date:   2018-01-04T23:35:20Z

    [SPARK-22953][K8S] Avoids adding duplicated secret volumes when 
init-container is used
    
    ## What changes were proposed in this pull request?
    
    User-specified secrets are mounted into both the main container and 
init-container (when it is used) in a Spark driver/executor pod, using the 
`MountSecretsBootstrap`. Because `MountSecretsBootstrap` always adds new secret 
volumes for the secrets to the pod, the same secret volumes get added twice, 
one when mounting the secrets to the main container, and the other when 
mounting the secrets to the init-container. This PR fixes the issue by 
separating `MountSecretsBootstrap.mountSecrets` out into two methods: 
`addSecretVolumes` for adding secret volumes to a pod and `mountSecrets` for 
mounting secret volumes to a container, respectively. `addSecretVolumes` is 
only called once for each pod, whereas `mountSecrets` is called individually 
for the main container and the init-container (if it is used).
    
    Ref: https://github.com/apache-spark-on-k8s/spark/issues/594.
    
    ## How was this patch tested?
    Unit tested and manually tested.
    
    vanzin This replaces https://github.com/apache/spark/pull/20148.
    hex108 foxish kimoonkim
    
    Author: Yinan Li <[email protected]>
    
    Closes #20159 from liyinan926/master.
    
    (cherry picked from commit e288fc87a027ec1e1a21401d1f151df20dbfecf3)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit ea9da6152af9223787cffd83d489741b4cc5aa34
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-05T00:34:56Z

    [SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly.
    
    - Make it possible to build images from a git clone.
    - Make it easy to use minikube to test things.
    
    Also fixed what seemed like a bug: the base image wasn't getting the tag
    provided in the command line. Adding the tag allows users to use multiple
    Spark builds in the same kubernetes cluster.
    
    Tested by deploying images on minikube and running spark-submit from a dev
    environment; also by building the images with different tags and verifying
    "docker images" in minikube.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20154 from vanzin/SPARK-22960.
    
    (cherry picked from commit 0428368c2c5e135f99f62be20877bbbda43be310)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 158f7e6a93b5acf4ce05c97b575124fd599cf927
Author: Juliusz Sompolski <julek@...>
Date:   2018-01-05T02:16:34Z

    [SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt
    
    ## What changes were proposed in this pull request?
    
    32bit Int was used for row rank.
    That overflowed in a dataframe with more than 2B rows.
    
    ## How was this patch tested?
    
    Added test, but ignored, as it takes 4 minutes.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #20152 from juliuszsompolski/SPARK-22957.
    
    (cherry picked from commit df7fc3ef3899cadd252d2837092bebe3442d6523)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 145820bda140d1385c4dd802fa79a871e6bf98be
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-05T06:02:21Z

    [SPARK-22825][SQL] Fix incorrect results of Casting Array to String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting arrays into strings;
    ```
    scala> val df = 
spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))
    scala> df.write.saveAsTable("t")
    scala> sql("SELECT cast(ids as String) FROM t").show(false)
    +------------------------------------------------------------------+
    |ids                                                               |
    +------------------------------------------------------------------+
    |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df|
    +------------------------------------------------------------------+
    ```
    
    This pr modified the result into;
    ```
    +------------------------------+
    |ids                           |
    +------------------------------+
    |[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]|
    +------------------------------+
    ```
    
    ## How was this patch tested?
    Added tests in `CastSuite` and `SQLQuerySuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20024 from maropu/SPARK-22825.
    
    (cherry picked from commit 52fc5c17d9d784b846149771b398e741621c0b5c)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 5b524cc0cd5a82e4fb0681363b6641e40b37075d
Author: Bago Amirbekian <bago@...>
Date:   2018-01-05T06:45:15Z

    [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed 
memory tradeoff for TrainValidationSplit
    
    ## What changes were proposed in this pull request?
    
    Avoid holding all models in memory for `TrainValidationSplit`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bago Amirbekian <[email protected]>
    
    Closes #20143 from MrBago/trainValidMemoryFix.
    
    (cherry picked from commit cf0aa65576acbe0209c67f04c029058fd73555c1)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit f9dcdbcefb545ced3f5b457e1e88c88a8e180f9f
Author: Yinan Li <liyinan926@...>
Date:   2018-01-05T07:23:41Z

    [SPARK-22757][K8S] Enable spark.jars and spark.files in KUBERNETES mode
    
    ## What changes were proposed in this pull request?
    
    We missed enabling `spark.files` and `spark.jars` in 
https://github.com/apache/spark/pull/19954. The result is that remote 
dependencies specified through `spark.files` or `spark.jars` are not included 
in the list of remote dependencies to be downloaded by the init-container. This 
PR fixes it.
    
    ## How was this patch tested?
    
    Manual tests.
    
    vanzin This replaces https://github.com/apache/spark/pull/20157.
    
    foxish
    
    Author: Yinan Li <[email protected]>
    
    Closes #20160 from liyinan926/SPARK-22757.
    
    (cherry picked from commit 6cff7d19f6a905fe425bd6892fe7ca014c0e696b)
    Signed-off-by: Felix Cheung <[email protected]>

commit fd4e30476894b7c37cc2ae6243a941f0bc90388d
Author: Adrian Ionescu <adrian@...>
Date:   2018-01-05T13:32:39Z

    [SPARK-22961][REGRESSION] Constant columns should generate 
QueryPlanConstraints
    
    ## What changes were proposed in this pull request?
    
    #19201 introduced the following regression: given something like 
`df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a 
constraint and infer filters from it when joins are involved, which may lead to 
noticeable performance degradation.
    
    This patch re-enables this optimization by picking up Aliases of Literals 
in Projection lists as constraints and making sure they're not treated as 
aliased columns.
    
    ## How was this patch tested?
    
    Unit test was added.
    
    Author: Adrian Ionescu <[email protected]>
    
    Closes #20155 from adrian-ionescu/constant_constraints.
    
    (cherry picked from commit 51c33bd0d402af9e0284c6cbc0111f926446bfba)
    Signed-off-by: gatorsmile <[email protected]>

commit 0a30e93507ba784729a498943e7eeda1d6f19fbf
Author: Bruce Robbins <bersprockets@...>
Date:   2018-01-05T17:58:28Z

    [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite should succeed on 
platforms that don't have wget
    
    ## What changes were proposed in this pull request?
    
    Modified HiveExternalCatalogVersionsSuite.scala to use Utils.doFetchFile to 
download different versions of Spark binaries rather than launching wget as an 
external process.
    
    On platforms that don't have wget installed, this suite fails with an error.
    
    cloud-fan : would you like to check this change?
    
    ## How was this patch tested?
    
    1) test-only of HiveExternalCatalogVersionsSuite on several platforms. 
Tested bad mirror, read timeout, and redirects.
    2) ./dev/run-tests
    
    Author: Bruce Robbins <[email protected]>
    
    Closes #20147 from bersprockets/SPARK-22940-alt.
    
    (cherry picked from commit c0b7424ecacb56d3e7a18acc11ba3d5e7be57c43)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit d1f422c1c12c8095e8522d1051a6e0e406748a3a
Author: Joseph K. Bradley <joseph@...>
Date:   2018-01-05T19:51:25Z

    [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
    
    ## What changes were proposed in this pull request?
    
    Follow-up cleanups for the OneHotEncoderEstimator PR.  See some discussion 
in the original PR: https://github.com/apache/spark/pull/19527 or read below 
for what this PR includes:
    * configedCategorySize: I reverted this to return an Array.  I realized the 
original setup (which I had recommended in the original PR) caused the whole 
model to be serialized in the UDF.
    * encoder: I reorganized the logic to show what I meant in the comment in 
the previous PR.  I think it's simpler but am open to suggestions.
    
    I also made some small style cleanups based on IntelliJ warnings.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #20132 from jkbradley/viirya-SPARK-13030.
    
    (cherry picked from commit 930b90a84871e2504b57ed50efa7b8bb52d3ba44)
    Signed-off-by: Joseph K. Bradley <[email protected]>

commit 55afac4e7b4f655aa05c5bcaf7851bb1e7699dba
Author: Gera Shegalov <gera@...>
Date:   2018-01-06T01:25:28Z

    [SPARK-22914][DEPLOY] Register history.ui.port
    
    ## What changes were proposed in this pull request?
    
    Register spark.history.ui.port as a known spark conf to be used in 
substitution expressions even if it's not set explicitly.
    
    ## How was this patch tested?
    
    Added unit test to demonstrate the issue
    
    Author: Gera Shegalov <[email protected]>
    Author: Gera Shegalov <[email protected]>
    
    Closes #20098 from gerashegalov/gera/register-SHS-port-conf.
    
    (cherry picked from commit ea956833017fcbd8ed2288368bfa2e417a2251c5)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit bf853018cabcd3b3abf84bfe534d2981020b4a71
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-06T01:26:03Z

    [SPARK-22937][SQL] SQL elt output binary for binary inputs
    
    ## What changes were proposed in this pull request?
    This pr modified `elt` to output binary for binary inputs.
    `elt` in the current master always output data as a string. But, in some 
databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary 
(Also, this might be a small surprise).
    This pr is related to #19977.
    
    ## How was this patch tested?
    Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #20135 from maropu/SPARK-22937.
    
    (cherry picked from commit e8af7e8aeca15a6107248f358d9514521ffdc6d3)
    Signed-off-by: gatorsmile <[email protected]>

commit 3e3e9386ed95435a2d1817653d1402c102e380dc
Author: Yinan Li <liyinan926@...>
Date:   2018-01-06T01:29:27Z

    [SPARK-22960][K8S] Revert use of ARG base_image in images
    
    ## What changes were proposed in this pull request?
    
    This PR reverts the `ARG base_image` before `FROM` in the images of driver, 
executor, and init-container, introduced in 
https://github.com/apache/spark/pull/20154. The reason is Docker versions 
before 17.06 do not support this use (`ARG` before `FROM`).
    
    ## How was this patch tested?
    
    Tested manually.
    
    vanzin foxish kimoonkim
    
    Author: Yinan Li <[email protected]>
    
    Closes #20170 from liyinan926/master.
    
    (cherry picked from commit bf65cd3cda46d5480bfcd13110975c46ca631972)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 7236914e5e7aeb4eb919530b6edbad70256cca52
Author: Li Jin <ice.xelloss@...>
Date:   2018-01-06T08:11:20Z

    [SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for 
non-deterministic cases
    
    ## What changes were proposed in this pull request?
    
    Add tests for using non deterministic UDFs in aggregate.
    
    Update pandas_udf docstring w.r.t to determinism.
    
    ## How was this patch tested?
    test_nondeterministic_udf_in_aggregate
    
    Author: Li Jin <[email protected]>
    
    Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.
    
    (cherry picked from commit f2dd8b923759e8771b0e5f59bfa7ae4ad7e6a339)
    Signed-off-by: gatorsmile <[email protected]>

commit e6449e8167776e3921c286d75e8cdd30ee33d77a
Author: zuotingbing <zuo.tingbing9@...>
Date:   2018-01-06T10:07:45Z

    [SPARK-22793][SQL] Memory leak in Spark Thrift Server
    
    # What changes were proposed in this pull request?
    1. Start HiveThriftServer2.
    2. Connect to thriftserver through beeline.
    3. Close the beeline.
    4. repeat step2 and step 3 for many times.
    we found there are many directories never be dropped under the path 
`hive.exec.local.scratchdir` and `hive.exec.scratchdir`, as we know the 
scratchdir has been added to deleteOnExit when it be created. So it means that 
the cache size of FileSystem `deleteOnExit` will keep increasing until JVM 
terminated.
    
    In addition, we use `jmap -histo:live [PID]`
    to printout the size of objects in HiveThriftServer2 Process, we can find 
the object `org.apache.spark.sql.hive.client.HiveClientImpl` and 
`org.apache.hadoop.hive.ql.session.SessionState` keep increasing even though we 
closed all the beeline connections, which may caused the leak of Memory.
    
    # How was this patch tested?
    manual tests
    
    This PR follw-up the https://github.com/apache/spark/pull/19989
    
    Author: zuotingbing <[email protected]>
    
    Closes #20029 from zuotingbing/SPARK-22793.
    
    (cherry picked from commit be9a804f2ef77a5044d3da7d9374976daf59fc16)
    Signed-off-by: gatorsmile <[email protected]>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22301: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec...

Reply via email to