[GitHub] spark pull request #20586: Branch 2.1

zhuge134 Mon, 12 Feb 2018 02:59:26 -0800

GitHub user zhuge134 opened a pull request:

    https://github.com/apache/spark/pull/20586


    Branch 2.1

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20586.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20586
    
----
commit 21afc4534f90e063330ad31033aa178b37ef8340
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-02-22T21:19:31Z

    [SPARK-19652][UI] Do auth checks for REST API access (branch-2.1).
    
    The REST API has a security filter that performs auth checks
    based on the UI root's security manager. That works fine when
    the UI root is the app's UI, but not when it's the history server.
    
    In the SHS case, all users would be allowed to see all applications
    through the REST API, even if the UI itself wouldn't be available
    to them.
    
    This change adds auth checks for each app access through the API
    too, so that only authorized users can see the app's data.
    
    The change also modifies the existing security filter to use
    `HttpServletRequest.getRemoteUser()`, which is used in other
    places. That is not necessarily the same as the principal's
    name; for example, when using Hadoop's SPNEGO auth filter,
    the remote user strips the realm information, which then matches
    the user name registered as the owner of the application.
    
    I also renamed the UIRootFromServletContext trait to a more generic
    name since I'm using it to store more context information now.
    
    Tested manually with an authentication filter enabled.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #17019 from vanzin/SPARK-19652_2.1.

commit d30238f1b9096c9fd85527d95be639de9388fcc7
Author: actuaryzhang <actuaryzhang10@...>
Date:   2017-02-23T19:12:02Z

    [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" 
takes vector index
    
    ## What changes were proposed in this pull request?
    The `[[` method is supposed to take a single index and return a column. 
This is different from base R which takes a vector index.  We should check for 
this and issue warning or error when vector index is supplied (which is very 
likely given the behavior in base R).
    
    Currently I'm issuing a warning message and just take the first element of 
the vector index. We could change this to an error it that's better.
    
    ## How was this patch tested?
    new tests
    
    Author: actuaryzhang <actuaryzhan...@gmail.com>
    
    Closes #17017 from actuaryzhang/sparkRSubsetter.
    
    (cherry picked from commit 7bf09433f5c5e08154ba106be21fe24f17cd282b)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit 43084b3cc3918b720fe28053d2037fa22a71264e
Author: Herman van Hovell <hvanhovell@...>
Date:   2017-02-23T22:58:02Z

    [SPARK-19459][SQL][BRANCH-2.1] Support for nested char/varchar fields in ORC
    
    ## What changes were proposed in this pull request?
    This is a backport of the two following commits: 
https://github.com/apache/spark/commit/78eae7e67fd5dec0c2d5b18000053ce86cd0f1ae 
& 
https://github.com/apache/spark/commit/de8a03e68202647555e30fffba551f65bc77608d
    
    This PR adds support for ORC tables with (nested) char/varchar fields.
    
    ## How was this patch tested?
    Added a regression test to `OrcSourceSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #17041 from hvanhovell/SPARK-19459-branch-2.1.

commit 66a7ca28a9de92e67ce24896a851a0c96c92aec6
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2017-02-24T09:54:00Z

    [SPARK-19691][SQL][BRANCH-2.1] Fix ClassCastException when calculating 
percentile of decimal column
    
    ## What changes were proposed in this pull request?
    This is a backport of the two following commits: 
https://github.com/apache/spark/commit/93aa4271596a30752dc5234d869c3ae2f6e8e723
    
    This pr fixed a class-cast exception below;
    ```
    scala> spark.range(10).selectExpr("cast (id as decimal) as 
x").selectExpr("percentile(x, 0.5)").collect()
     java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
cast to java.lang.Number
        at 
org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
        at 
org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
        at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
        at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
        at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
        at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
        at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
        at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
        at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
        at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
        at
    ```
    This fix simply converts catalyst values (i.e., `Decimal`) into scala ones 
by using `CatalystTypeConverters`.
    
    ## How was this patch tested?
    Added a test in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #17046 from maropu/SPARK-19691-BACKPORT2.1.

commit 6da6a27f673f6e45fe619e0411fbaaa14ea34bfb
Author: jerryshao <sshao@...>
Date:   2017-02-24T17:28:59Z

    [SPARK-19707][CORE] Improve the invalid path check for sc.addJar
    
    ## What changes were proposed in this pull request?
    
    Currently in Spark there're two issues when we add jars with invalid path:
    
    * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will 
resolve it to the current directory path and add to classpath / file server, 
which is unwanted. This is happened in our programatic way to submit Spark 
application. From my understanding Spark should defensively filter out such 
empty path.
    * If the jar path is a invalid path (file doesn't exist), `addJar` doesn't 
check it and will still add to file server, the exception will be delayed until 
job running. Actually this local path could be checked beforehand, no need to 
wait until task running. We have similar check in `addFile`, but lacks similar 
similar mechanism in `addJar`.
    
    ## How was this patch tested?
    
    Add unit test and local manual verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17038 from jerryshao/SPARK-19707.
    
    (cherry picked from commit b0a8c16fecd4337f77bfbe4b45884254d7af52bd)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit ed9aaa3147553b737b852995ece67d1121467d0c
Author: jerryshao <sshao@...>
Date:   2017-02-24T17:31:52Z

    [SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-client
    
    ## What changes were proposed in this pull request?
    
    Because yarn#client will reset the `spark.yarn.keytab` configuration to 
point to the location in distributed file, so if user still uses the old 
`SparkConf` to create `SparkSession` with Hive enabled, it will read keytab 
from the path in distributed cached. This is OK for yarn cluster mode, but in 
yarn client mode where driver is running out of container, it will be failed to 
fetch the keytab.
    
    So here we should avoid reseting this configuration in the `yarn#client` 
and only overwriting it for AM, so using `spark.yarn.keytab` could get correct 
keytab path no matter running in client (keytab in local fs) or cluster (keytab 
in distributed cache) mode.
    
    ## How was this patch tested?
    
    Verified in security cluster.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #16923 from jerryshao/SPARK-19038.
    
    (cherry picked from commit a920a4369434c84274866a09f61e402232c3b47c)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit 97866e198afe07824d041293849d9302e734d58f
Author: Boaz Mohar <boazmohar@...>
Date:   2017-02-25T19:32:09Z

    [MINOR][DOCS] Fixes two problems in the SQL programing guide page
    
    ## What changes were proposed in this pull request?
    
    Removed duplicated lines in sql python example and found a typo.
    
    ## How was this patch tested?
    
    Searched for other typo's in the page to minimize PR's.
    
    Author: Boaz Mohar <boazmo...@gmail.com>
    
    Closes #17066 from boazmohar/doc-fix.
    
    (cherry picked from commit 061bcfb869fe5f64edd9ee2352fecd70665da317)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 20a432951c6281bb6d6bf9252ad5a352fef00424
Author: Bryan Cutler <cutlerb@...>
Date:   2017-02-26T04:03:27Z

    [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala 
implementation
    
    ## What changes were proposed in this pull request?
    Fixed the PySpark Params.copy method to behave like the Scala 
implementation.  The main issue was that it did not account for the 
_defaultParamMap and merged it into the explicitly created param map.
    
    ## How was this patch tested?
    Added new unit test to verify the copy method behaves correctly for copying 
uid, explicitly created params, and default params.
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #17048 from 
BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.

commit 04fbb9e0986ffdf61813eff7f0c36b1f0766f6df
Author: Eyal Zituny <eyal.zituny@...>
Date:   2017-02-26T23:57:32Z

    [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle 
QueryTerminatedEvent if more then one listeners exists
    
    ## What changes were proposed in this pull request?
    
    currently if multiple streaming queries listeners exists, when a 
QueryTerminatedEvent is triggered, only one of the listeners will be invoked 
while the rest of the listeners will ignore the event.
    this is caused since the the streaming queries listeners bus holds a set of 
running queries ids and when a termination event is triggered, after the first 
listeners is handling the event, the terminated query id is being removed from 
the set.
    in this PR, the query id will be removed from the set only after all the 
listeners handles the event
    
    ## How was this patch tested?
    
    a test with multiple listeners has been added to StreamingQueryListenerSuite
    
    Author: Eyal Zituny <eyal.zit...@equalum.io>
    
    Closes #16991 from eyalzit/master.
    
    (cherry picked from commit 9f8e392159ba65decddf62eb3cd85b6821db01b4)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 4b4c3bf3f78635d53ff983eabe37a4032947b499
Author: windpiger <songjun@...>
Date:   2017-02-28T08:16:49Z

    [SPARK-19748][SQL] refresh function has a wrong order to do cache 
invalidate and regenerate the inmemory var for InMemoryFileIndex with 
FileStatusCache
    
    ## What changes were proposed in this pull request?
    
    If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use 
the FileStatusCache to re-generate the cachedLeafFiles etc, then call 
FileStatusCache.invalidateAll.
    
    While the order to do these two actions is wrong, this lead to the refresh 
action does not take effect.
    
    ```
      override def refresh(): Unit = {
        refresh0()
        fileStatusCache.invalidateAll()
      }
    
      private def refresh0(): Unit = {
        val files = listLeafFiles(rootPaths)
        cachedLeafFiles =
          new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => 
f.getPath -> f)
        cachedLeafDirToChildrenFiles = 
files.toArray.groupBy(_.getPath.getParent)
        cachedPartitionSpec = null
      }
    ```
    ## How was this patch tested?
    unit test added
    
    Author: windpiger <song...@outlook.com>
    
    Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.
    
    (cherry picked from commit a350bc16d36c58b48ac01f0258678ffcdb77e793)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 947c0cd901a75e110ea3c1767a54a22b8d033972
Author: Roberto Agostino Vitillo <ra.vitillo@...>
Date:   2017-02-28T18:49:07Z

    [SPARK-19677][SS] Committing a delta file atop an existing one should not 
fail on HDFS
    
    ## What changes were proposed in this pull request?
    
    HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the 
local filesystem. According to the [implementation 
notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html)
 of `rename()`, the behavior of the local filesystem and HDFS varies:
    
    > Destination exists and is a file
    > Renaming a file atop an existing file is specified as failing, raising an 
exception.
    >    - Local FileSystem : the rename succeeds; the destination file is 
replaced by the source file.
    >    - HDFS : The rename fails, no exception is raised. Instead the method 
call simply returns false.
    
    This patch ensures that `rename()` isn't called if the destination file 
already exists. It's still semantically correct because Structured Streaming 
requires that rerunning a batch should generate the same output.
    
    ## How was this patch tested?
    
    This patch was tested by running `StateStoreSuite`.
    
    Author: Roberto Agostino Vitillo <ra.viti...@gmail.com>
    
    Closes #17012 from vitillo/fix_rename.
    
    (cherry picked from commit 9734a928a75d29ea202e9f309f92ca4637d35671)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit d887f758152be4d6e089066a97b1eab817d3be83
Author: Michael McCune <msm@...>
Date:   2017-02-28T23:07:16Z

    [SPARK-19769][DOCS] Update quickstart instructions
    
    ## What changes were proposed in this pull request?
    
    This change addresses the renaming of the `simple.sbt` build file to
    `build.sbt`. Newer versions of the sbt tool are not finding the older
    named file and are looking for the `build.sbt`. The quickstart
    instructions for self-contained applications is updated with this
    change.
    
    ## How was this patch tested?
    
    As this is a relatively minor change of a few words, the markdown was 
checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` 
for testing purposes.
    
    Author: Michael McCune <m...@redhat.com>
    
    Closes #17101 from elmiko/spark-19769.
    
    (cherry picked from commit bf5987cbe6c9f4a1a91d912ed3a9098111632d1a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit f719cccdc46247d7d86a99a1eb177522d4a657ae
Author: Jeff Zhang <zjffdu@...>
Date:   2017-03-01T06:21:29Z

    [SPARK-19572][SPARKR] Allow to disable hive in sparkR shell
    
    ## What changes were proposed in this pull request?
    SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This 
is not only for sparkR itself, but can also benefit downstream project like 
livy which use shell.R for its interactive session. For now, livy has no 
control of whether enable hive or not.
    
    ## How was this patch tested?
    
    Tested it manually, run `bin/sparkR --master local --conf 
spark.sql.catalogImplementation=in-memory` and verify hive is not enabled.
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #16907 from zjffdu/SPARK-19572.
    
    (cherry picked from commit 7315880568fd07d4dfb9f76d538f220e9d320c6f)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit bbe0d8caa88cfe5e3cde80b85898a198d785370d
Author: Stan Zhai <zhaishidan@...>
Date:   2017-03-01T15:52:35Z

    [SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be 
folded by FoldablePropagation rule
    
    ## What changes were proposed in this pull request?
    This PR fixes the code in Optimizer phase where the constant alias columns 
of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.
    
    For the following query():
    
    ```
    val sqlA =
      """
        |create temporary view ta as
        |select a, 'a' as tag from t1 union all
        |select a, 'b' as tag from t2
      """.stripMargin
    
    val sqlB =
      """
        |create temporary view tb as
        |select a, 'a' as tag from t3 union all
        |select a, 'b' as tag from t4
      """.stripMargin
    
    val sql =
      """
        |select tb.* from ta inner join tb on
        |ta.a = tb.a and
        |ta.tag = tb.tag
      """.stripMargin
    ```
    
    The tag column is an constant alias column, it's folded by 
`FoldablePropagation` like this:
    
    ```
    TRACE SparkOptimizer:
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
     Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = 
a#4) && (a = a))
        :- Union                                           :- Union
        :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, 
a AS tag#8]
        :  :  +- LocalRelation [a#0]                       :  :  +- 
LocalRelation [a#0]
        :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, 
b AS tag#9]
        :     +- LocalRelation [a#2]                       :     +- 
LocalRelation [a#2]
        +- Union                                           +- Union
           :- Project [a#4, a AS tag#14]                      :- Project [a#4, 
a AS tag#14]
           :  +- LocalRelation [a#4]                          :  +- 
LocalRelation [a#4]
           +- Project [a#6, b AS tag#15]                      +- Project [a#6, 
b AS tag#15]
              +- LocalRelation [a#6]                             +- 
LocalRelation [a#6]
    ```
    
    Finally the Result of Batch Operator Optimizations is:
    
    ```
    Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = 
a#4)
    !   :- SubqueryAlias ta, `ta`                          :- Union
    !   :  +- Union                                        :  :- LocalRelation 
[a#0]
    !   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation 
[a#2]
    !   :     :  +- SubqueryAlias t1, `t1`                 +- Union
    !   :     :     +- Project [a#0]                          :- LocalRelation 
[a#4, tag#14]
    !   :     :        +- SubqueryAlias grouping              +- LocalRelation 
[a#6, tag#15]
    !   :     :           +- LocalRelation [a#0]
    !   :     +- Project [a#2, b AS tag#9]
    !   :        +- SubqueryAlias t2, `t2`
    !   :           +- Project [a#2]
    !   :              +- SubqueryAlias grouping
    !   :                 +- LocalRelation [a#2]
    !   +- SubqueryAlias tb, `tb`
    !      +- Union
    !         :- Project [a#4, a AS tag#14]
    !         :  +- SubqueryAlias t3, `t3`
    !         :     +- Project [a#4]
    !         :        +- SubqueryAlias grouping
    !         :           +- LocalRelation [a#4]
    !         +- Project [a#6, b AS tag#15]
    !            +- SubqueryAlias t4, `t4`
    !               +- Project [a#6]
    !                  +- SubqueryAlias grouping
    !                     +- LocalRelation [a#6]
    ```
    
    The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads 
to the data of inner join being wrong.
    
    After fix:
    
    ```
    === Result of Batch LocalRelation ===
     GlobalLimit 21                                           GlobalLimit 21
     +- LocalLimit 21                                         +- LocalLimit 21
        +- Project [a#4, tag#11]                                 +- Project 
[a#4, tag#11]
           +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join 
Inner, ((a#0 = a#4) && (tag#8 = tag#11))
    !         :- SubqueryAlias ta                                      :- Union
    !         :  +- Union                                              :  :- 
LocalRelation [a#0, tag#8]
    !         :     :- Project [a#0, a AS tag#8]                       :  +- 
LocalRelation [a#2, tag#9]
    !         :     :  +- SubqueryAlias t1                             +- Union
    !         :     :     +- Project [a#0]                                :- 
LocalRelation [a#4, tag#11]
    !         :     :        +- SubqueryAlias grouping                    +- 
LocalRelation [a#6, tag#12]
    !         :     :           +- LocalRelation [a#0]
    !         :     +- Project [a#2, b AS tag#9]
    !         :        +- SubqueryAlias t2
    !         :           +- Project [a#2]
    !         :              +- SubqueryAlias grouping
    !         :                 +- LocalRelation [a#2]
    !         +- SubqueryAlias tb
    !            +- Union
    !               :- Project [a#4, a AS tag#11]
    !               :  +- SubqueryAlias t3
    !               :     +- Project [a#4]
    !               :        +- SubqueryAlias grouping
    !               :           +- LocalRelation [a#4]
    !               +- Project [a#6, b AS tag#12]
    !                  +- SubqueryAlias t4
    !                     +- Project [a#6]
    !                        +- SubqueryAlias grouping
    !                           +- LocalRelation [a#6]
    ```
    
    ## How was this patch tested?
    
    add sql-tests/inputs/inner-join.sql
    All tests passed.
    
    Author: Stan Zhai <zhaishi...@haizhi.com>
    
    Closes #17099 from stanzhai/fix-inner-join.
    
    (cherry picked from commit 5502a9cf883b2058209904c152e5d2c2a106b072)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 27347b5f26f668783d8ded89149a5e761b67f786
Author: Michael Gummelt <mgummelt@...>
Date:   2017-03-01T23:32:32Z

    [SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio â¦
    
    â¦on registered cores rather than accepted cores
    
    See JIRA
    
    Unit tests, Mesos/Spark integration tests
    
    cc skonto susanxhuynh
    
    Author: Michael Gummelt <mgummeltmesosphere.io>
    
    Closes #17045 from mgummelt/SPARK-19373-registered-resources.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Michael Gummelt <mgumm...@mesosphere.io>
    
    Closes #17129 from mgummelt/SPARK-19373-registered-resources-2.1.

commit 3a7591ad5315308d24c0e444ce304ff78aef2304
Author: jerryshao <sshao@...>
Date:   2017-03-03T01:18:52Z

    [SPARK-19750][UI][BRANCH-2.1] Fix redirect issue from http to https
    
    ## What changes were proposed in this pull request?
    
    If spark ui port (4040) is not set, it will choose port number 0, this will 
make https port to also choose 0. And in Spark 2.1 code, it will use this https 
port (0) to do redirect, so when redirect triggered, it will point to a wrong 
url:
    
    like:
    
    ```
    /tmp/temp$ wget http://172.27.25.134:55015
    --2017-02-23 12:13:54--  http://172.27.25.134:55015/
    Connecting to 172.27.25.134:55015... connected.
    HTTP request sent, awaiting response... 302 Found
    Location: https://172.27.25.134:0/ [following]
    --2017-02-23 12:13:54--  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:13:55--  (try: 2)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:13:57--  (try: 3)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    --2017-02-23 12:14:00--  (try: 4)  https://172.27.25.134:0/
    Connecting to 172.27.25.134:0... failed: Can't assign requested address.
    Retrying.
    
    ```
    
    So instead of using 0 to do redirect, we should pick a bound port instead.
    
    This issue only exists in Spark 2.1-, and can be reproduced in yarn cluster 
mode.
    
    ## How was this patch tested?
    
    Current redirect UT doesn't verify this issue, so extend current UT to do 
correct verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17083 from jerryshao/SPARK-19750.

commit 1237aaea279d6aac504ae1e3265c0b53779b5303
Author: guifeng <guifengleaf@...>
Date:   2017-03-03T05:19:29Z

    [SPARK-19779][SS] Delete needless tmp file after restart structured 
streaming job
    
    ## What changes were proposed in this pull request?
    
    [SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)
    
    The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
Structured Streaming application using hdfs as fileSystem, but also exist a 
problem that a tmp file of delta file is still reserved in hdfs. And Structured 
Streaming don't delete the tmp file generated when restart streaming job in 
future.
    
    ## How was this patch tested?
     unit tests
    
    Author: guifeng <guifengl...@gmail.com>
    
    Closes #17124 from gf53520/SPARK-19779.
    
    (cherry picked from commit e24f21b5f8365ed25346e986748b393e0b4be25c)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit accbed7c2cfbe46fa6f55e97241b617c6ad4431f
Author: Zhe Sun <ymwdalex@...>
Date:   2017-03-03T10:55:57Z

    [SPARK-19797][DOC] ML pipeline document correction
    
    ## What changes were proposed in this pull request?
    Description about pipeline in this paragraph is incorrect 
https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works
    
    > If the Pipeline had more **stages**, it would call the 
LogisticRegressionModelâs transform() method on the DataFrame before passing 
the DataFrame to the next stage.
    
    Reason: Transformer could also be a stage. But only another Estimator will 
invoke an transform call and pass the data to next stage. The description in 
the document misleads ML pipeline users.
    
    ## How was this patch tested?
    This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the 
modification and check the compiled document.
    
    Author: Zhe Sun <ymwda...@gmail.com>
    
    Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
    
    (cherry picked from commit 0bac3e4cde75678beac02e67b8873fe779e9ad34)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit da04d45c2c3c98322220c57cee90be78cf2093d0
Author: Burak Yavuz <brkyvz@...>
Date:   2017-03-03T18:35:15Z

    [SPARK-19774] StreamExecution should call stop() on sources when a stream 
fails
    
    ## What changes were proposed in this pull request?
    
    We call stop() on a Structured Streaming Source only when the stream is 
shutdown when a user calls streamingQuery.stop(). We should actually stop all 
sources when the stream fails as well, otherwise we may leak resources, e.g. 
connections to Kafka.
    
    ## How was this patch tested?
    
    Unit tests in `StreamingQuerySuite`.
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #17107 from brkyvz/close-source.
    
    (cherry picked from commit 9314c08377cc8da88f4e31d1a9d41376e96a81b3)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 664c9795c94d3536ff9fe54af06e0fb6c0012862
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-04T03:00:35Z

    [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't 
recover the log level
    
    ## What changes were proposed in this pull request?
    
    "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
running after it won't output any logs except fatal logs.
    
    This PR uses `testQuietly` instead to avoid changing the log level.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17156 from zsxwing/SPARK-19816.
    
    (cherry picked from commit fbc4058037cf5b0be9f14a7dd28105f7f8151bed)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit ca7a7e8a893a30d85e4315a4fa1ca1b1c56a703c
Author: uncleGen <hustyugm@...>
Date:   2017-03-06T02:17:30Z

    [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not 
filter checkpointFilesOfLatestTime with the PATH string.
    
    ## What changes were proposed in this pull request?
    
    
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/
    
    ```
    sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code
    passed to eventually never returned normally. Attempted 617 times over 
10.003740484 seconds.
    Last failure message: 8 did not equal 2.
        at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
        at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
        at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
        at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
        at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
        at 
org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite
    .scala:172)
        at 
org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211)
    ```
    
    the check condition is:
    
    ```
    val checkpointFilesOfLatestTime = 
Checkpoint.getCheckpointFiles(checkpointDir).filter {
         _.toString.contains(clock.getTimeMillis.toString)
    }
    // Checkpoint files are written twice for every batch interval. So assert 
that both
    // are written to make sure that both of them have been written.
    assert(checkpointFilesOfLatestTime.size === 2)
    ```
    
    the path string may contain the `clock.getTimeMillis.toString`, like `3500` 
:
    
    ```
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500
                                                           â²â²â²â²
    ```
    
    so we should only check the filename, but not the whole path.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <husty...@gmail.com>
    
    Closes #17167 from uncleGen/flaky-CheckpointSuite.
    
    (cherry picked from commit 207067ead6db6dc87b0d144a658e2564e3280a89)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit fd6c6d5c363008a229759bf628edc0f6c5e00ade
Author: Tyson Condie <tcondie@...>
Date:   2017-03-07T00:39:05Z

    [SPARK-19719][SS] Kafka writer for both structured streaming and batch 
queires
    
    ## What changes were proposed in this pull request?
    
    Add a new Kafka Sink and Kafka Relation for writing streaming and batch 
queries, respectively, to Apache Kafka.
    ### Streaming Kafka Sink
    - When addBatch is called
    -- If batchId is great than the last written batch
    --- Write batch to Kafka
    ---- Topic will be taken from the record, if present, or from a topic 
option, which overrides topic in record.
    -- Else ignore
    
    ### Batch Kafka Sink
    - KafkaSourceProvider will implement CreatableRelationProvider
    - CreatableRelationProvider#createRelation will write the passed in 
Dataframe to a Kafka
    - Topic will be taken from the record, if present, or from topic option, 
which overrides topic in record.
    - Save modes Append and ErrorIfExist supported under identical semantics. 
Other save modes result in an AnalysisException
    
    tdas zsxwing
    
    ## How was this patch tested?
    
    ### The following unit tests will be included
    - write to stream with topic field: valid stream write with data that 
includes an existing topic in the schema
    - write structured streaming aggregation w/o topic field, with default 
topic: valid stream write with data that does not include a topic field, but 
the configuration includes a default topic
    - write data with bad schema: various cases of writing data that does not 
conform to a proper schema e.g., 1. no topic field or default topic, and 2. no 
value field
    - write data with valid schema but wrong types: data with a complete schema 
but wrong types e.g., key and value types are integers.
    - write to non-existing topic: write a stream to a topic that does not 
exist in Kafka, which has been configured to not auto-create topics.
    - write batch to kafka: simple write batch to Kafka, which goes through the 
same code path as streaming scenario, so validity checks will not be redone 
here.
    
    ### Examples
    ```scala
    // Structured Streaming
    val writer = inputStringStream.map(s => 
s.get(0).toString.getBytes()).toDF("value")
     .selectExpr("value as key", "value as value")
     .writeStream
     .format("kafka")
     .option("checkpointLocation", checkpointDir)
     .outputMode(OutputMode.Append)
     .option("kafka.bootstrap.servers", brokerAddress)
     .option("topic", topic)
     .queryName("kafkaStream")
     .start()
    
    // Batch
    val df = spark
     .sparkContext
     .parallelize(Seq("1", "2", "3", "4", "5"))
     .map(v => (topic, v))
     .toDF("topic", "value")
    
    df.write
     .format("kafka")
     .option("kafka.bootstrap.servers",brokerAddress)
     .option("topic", topic)
     .save()
    ```
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tyson Condie <tcon...@gmail.com>
    
    Closes #17043 from tcondie/kafka-writer.

commit 711addd46e98e42deca97c5b9c0e55fddebaa458
Author: Jason White <jason.white@...>
Date:   2017-03-07T21:14:37Z

    [SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long
    
    ## What changes were proposed in this pull request?
    
    Cast the output of `TimestampType.toInternal` to long to allow for proper 
Timestamp creation in DataFrames near the epoch.
    
    ## How was this patch tested?
    
    Added a new test that fails without the change.
    
    dongjoon-hyun davies Mind taking a look?
    
    The contribution is my original work and I license the work to the project 
under the projectâs open source license.
    
    Author: Jason White <jason.wh...@shopify.com>
    
    Closes #16896 from JasonMWhite/SPARK-19561.
    
    (cherry picked from commit 6f4684622a951806bebe7652a14f7d1ce03e24c7)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit 551b7bdbe00b9ee803baa18e6b4690c478af9161
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-03-08T00:21:18Z

    [SPARK-19857][YARN] Correctly calculate next credential update time.
    
    Add parentheses so that both lines form a single statement; also add
    a log message so that the issue becomes more explicit if it shows up
    again.
    
    Tested manually with integration test that exercises the feature.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #17198 from vanzin/SPARK-19857.
    
    (cherry picked from commit 8e41c2eed873e215b13215844ba5ba73a8906c5b)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit cbc37007aa07991135a3da13ad566be76a0ef577
Author: Wenchen Fan <wenchen@...>
Date:   2017-03-08T01:15:39Z

    Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long"
    
    This reverts commit 6f4684622a951806bebe7652a14f7d1ce03e24c7.

commit 3b648a62626850470f8cceea3f0ec5dfd46e4e33
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-08T04:34:55Z

    [SPARK-19859][SS] The new watermark should override the old one
    
    ## What changes were proposed in this pull request?
    
    The new watermark should override the old one. Otherwise, we just pick up 
the first column which has a watermark, it may be unexpected.
    
    ## How was this patch tested?
    
    The new test.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17199 from zsxwing/SPARK-19859.
    
    (cherry picked from commit d8830c5039d9c7c5ef03631904c32873ab558e22)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 0ba9ecbea88533b2562f2f6045eafeab99d8f0c6
Author: Bryan Cutler <cutlerb@...>
Date:   2017-03-08T04:44:30Z

    [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
    
    ## What changes were proposed in this pull request?
    The `keyword_only` decorator in PySpark is not thread-safe.  It writes 
kwargs to a static class variable in the decorator, which is then retrieved 
later in the class method as `_input_kwargs`.  If multiple threads are 
constructing the same class with different kwargs, it becomes a race condition 
to read from the static class variable before it's overwritten.  See 
[SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for 
reproduction code.
    
    This change will write the kwargs to a member variable so that multiple 
threads can operate on separate instances without the race condition.  It does 
not protect against multiple threads operating on a single instance, but that 
is better left to the user to synchronize.
    
    ## How was this patch tested?
    Added new unit tests for using the keyword_only decorator and a regression 
test that verifies `_input_kwargs` can be overwritten from different class 
instances.
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #17193 from 
BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.

commit 320eff14b0bb634eba2cdcae2303ba38fd0eb282
Author: Michael Armbrust <michael@...>
Date:   2017-03-08T09:32:42Z

    [SPARK-18055][SQL] Use correct mirror in ExpresionEncoder
    
    Previously, we were using the mirror of passed in `TypeTag` when reflecting 
to build an encoder.  This fails when the outer class is built in (i.e. `Seq`'s 
default mirror is based on root classloader) but inner classes (i.e. `A` in 
`Seq[A]`) are defined in the REPL or a library.
    
    This patch changes us to always reflect based on a mirror created using the 
context classloader.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #17201 from marmbrus/replSeqEncoder.
    
    (cherry picked from commit 314e48a3584bad4b486b046bbf0159d64ba857bc)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f6c1ad2eb6d0706899aabbdd39e558b3488e2ef3
Author: Burak Yavuz <brkyvz@...>
Date:   2017-03-08T22:35:07Z

    [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in 
combination with maxFileAge in FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    **The Problem**
    There is a file stream source option called maxFileAge which limits how old 
the files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
    This causes a problem when both
    latestFirst = true
    maxFilesPerTrigger > total files to be processed.
    Here is what happens in all combinations
    1) latestFirst = false - Since files are processed in order, there wont be 
any unprocessed file older than the latest processed file. All files will be 
processed.
    2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
    3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing.
    The bug is with case 3.
    
    **The Solution**
    
    Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are 
set.
    
    ## How was this patch tested?
    
    Regression test in `FileStreamSourceSuite`
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #17153 from brkyvz/maxFileAge.
    
    (cherry picked from commit a3648b5d4f99ff9461d02f53e9ec71787a3abf51)
    Signed-off-by: Burak Yavuz <brk...@gmail.com>

commit 3457c32297e0150a4fbc80a30f84b9c62ca7c372
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-08T22:30:54Z

    Revert "[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful 
operations for branch-2.1"
    
    This reverts commit 502c927b8c8a99ef2adf4e6e1d7a6d9232d45ef5.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20586: Branch 2.1

Reply via email to