[GitHub] spark pull request #18782: Branch 2.1

xiaobing007 Mon, 31 Jul 2017 00:22:15 -0700

GitHub user xiaobing007 opened a pull request:

    https://github.com/apache/spark/pull/18782


    Branch 2.1

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18782
    
----
commit b12a76a411cf49baf53e265a194ba41adfb8d9f4
Author: Takeshi YAMAMURO <[email protected]>
Date:   2017-01-26T17:50:42Z

    [SPARK-19338][SQL] Add UDF names in explain
    
    ## What changes were proposed in this pull request?
    This pr added a variable for a UDF name in `ScalaUDF`.
    Then, if the variable filled, `DataFrame#explain` prints the name.
    
    ## How was this patch tested?
    Added a test in `UDFSuite`.
    
    Author: Takeshi YAMAMURO <[email protected]>
    
    Closes #16707 from maropu/SPARK-19338.
    
    (cherry picked from commit 9f523d3192c71a728fd8a2a64f52bbc337f2f026)
    Signed-off-by: gatorsmile <[email protected]>

commit 59502bbcf6e64e5b5e3dda080441054afaf58c53
Author: Marcelo Vanzin <[email protected]>
Date:   2017-01-27T00:53:28Z

    [SPARK-19220][UI] Make redirection to HTTPS apply to all URIs. (branch-2.1)
    
    The redirect handler was installed only for the root of the server;
    any other context ended up being served directly through the HTTP
    port. Since every sub page (e.g. application UIs in the history
    server) is a separate servlet context, this meant that everything
    but the root was accessible via HTTP still.
    
    The change adds separate names to each connector, and binds contexts
    to specific connectors so that content is only served through the
    HTTPS connector when it's enabled. In that case, the only thing that
    binds to the HTTP connector is the redirect handler.
    
    Tested with new unit tests and by checking a live history server.
    
    (cherry picked from commit d3dcb63b9709a34337327be9b7d3705698716077)
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #16711 from vanzin/SPARK-19220_2.1.

commit ba2a5ada4825a9ca3e4e954a51574a2eede096a3
Author: Felix Cheung <[email protected]>
Date:   2017-01-27T05:06:39Z

    [SPARK-18788][SPARKR] Add API for getNumPartitions
    
    ## What changes were proposed in this pull request?
    
    With doc to say this would convert DF into RDD
    
    ## How was this patch tested?
    
    unit tests, manual tests
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16668 from felixcheung/rgetnumpartitions.
    
    (cherry picked from commit 90817a6cd06068fa9f9ff77384a1fcba73b43006)
    Signed-off-by: Felix Cheung <[email protected]>

commit 4002ee97dfd67a6305d062705df8f539cdbc8ac8
Author: Felix Cheung <[email protected]>
Date:   2017-01-27T18:31:28Z

    [SPARK-19333][SPARKR] Add Apache License headers to R files
    
    ## What changes were proposed in this pull request?
    
    add header
    
    ## How was this patch tested?
    
    Manual run to check vignettes html is created properly
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16709 from felixcheung/rfilelicense.
    
    (cherry picked from commit 385d73848b0d274467b633c7615e03b370f4a634)
    Signed-off-by: Felix Cheung <[email protected]>

commit 9a49f9afa7fcf2f968914ac81d13e27db3451491
Author: Felix Cheung <[email protected]>
Date:   2017-01-27T20:41:35Z

    [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR
    
    ## What changes were proposed in this pull request?
    
    This affects mostly running job from the driver in client mode when results 
are expected to be through stdout (which should be somewhat rare, but possible)
    
    Before:
    ```
    > a <- as.DataFrame(cars)
    > b <- group_by(a, "dist")
    > c <- count(b)
    > sparkR.callJMethod(c$countjc, "explain", TRUE)
    NULL
    ```
    
    After:
    ```
    > a <- as.DataFrame(cars)
    > b <- group_by(a, "dist")
    > c <- count(b)
    > sparkR.callJMethod(c$countjc, "explain", TRUE)
    count#11L
    NULL
    ```
    
    Now, `column.explain()` doesn't seem very useful (we can get more extensive 
output with `DataFrame.explain()`) but there are other more complex examples 
with calls of `println` in Scala/JVM side, that are getting dropped.
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16670 from felixcheung/rjvmstdout.
    
    (cherry picked from commit a7ab6f9a8fdfb927f0bcefdc87a92cc82fac4223)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 445438c9f485489f22a1c4b9ec2644a7a9426d9b
Author: gatorsmile <[email protected]>
Date:   2017-01-30T22:05:53Z

    [SPARK-19396][DOC] JDBC Options are Case In-sensitive
    
    ### What changes were proposed in this pull request?
    The case are not sensitive in JDBC options, after the PR 
https://github.com/apache/spark/pull/15884 is merged to Spark 2.1.
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #16734 from gatorsmile/fixDocCaseInsensitive.
    
    (cherry picked from commit c0eda7e87fe06c5ec8d146829e25f3627f18c529)
    Signed-off-by: gatorsmile <[email protected]>

commit 07a1788ee04597700f101183e67d237f9a866c46
Author: gatorsmile <[email protected]>
Date:   2017-01-31T02:38:14Z

    [SPARK-19406][SQL] Fix function to_json to respect user-provided options
    
    ### What changes were proposed in this pull request?
    Currently, the function `to_json` allows users to provide options for 
generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it 
ignores the user-provided options. This PR is to fix it. Below is an example.
    
    ```Scala
    val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 
18:00:00.0")))).toDF("a")
    val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
    df.select(to_json($"a", options)).show(false)
    ```
    The current output is like
    ```
    +--------------------------------------+
    |structtojson(a)                       |
    +--------------------------------------+
    |{"_1":"2015-08-26T18:00:00.000-07:00"}|
    +--------------------------------------+
    ```
    
    After the fix, the output is like
    ```
    +-------------------------+
    |structtojson(a)          |
    +-------------------------+
    |{"_1":"26/08/2015 18:00"}|
    +-------------------------+
    ```
    ### How was this patch tested?
    Added test cases for both `from_json` and `to_json`
    
    Author: gatorsmile <[email protected]>
    
    Closes #16745 from gatorsmile/toJson.
    
    (cherry picked from commit f9156d2956a8e751720bf63071c504a3e86f267d)
    Signed-off-by: gatorsmile <[email protected]>

commit e43f161bbe04d2dc2af1e2f9280d4d0b47392acf
Author: Felix Cheung <[email protected]>
Date:   2017-01-31T06:14:58Z

    [BACKPORT-2.1][SPARKR][DOCS] update R API doc for subset/extract
    
    ## What changes were proposed in this pull request?
    
    backport #16721 to branch-2.1
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <[email protected]>
    
    Closes #16749 from felixcheung/rsubsetdocbackport.

commit d35a1268d784a268e6137eff54eb8f83c981a289
Author: Burak Yavuz <[email protected]>
Date:   2017-02-01T00:52:53Z

    [SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics 
even if there is no new data in trigger
    
    In StructuredStreaming, if a new trigger was skipped because no new data 
arrived, we suddenly report nothing for the metrics `stateOperator`. We could 
however easily report the metrics from `lastExecution` to ensure continuity of 
metrics.
    
    Regression test in `StreamingQueryStatusAndProgressSuite`
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #16716 from brkyvz/state-agg.
    
    (cherry picked from commit 081b7addaf9560563af0ce25912972e91a78cee6)
    Signed-off-by: Tathagata Das <[email protected]>

commit 61cdc8c7cc8cfc57646a30da0e0df874a14e3269
Author: Zheng RuiFeng <[email protected]>
Date:   2017-02-01T13:27:20Z

    [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning
    
    ## What changes were proposed in this pull request?
    Fix brokens links in ml-pipeline and ml-tuning
    `<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`
    
    ## How was this patch tested?
    manual tests
    
    Author: Zheng RuiFeng <[email protected]>
    
    Closes #16754 from zhengruifeng/doc_api_fix.
    
    (cherry picked from commit 04ee8cf633e17b6bf95225a8dd77bf2e06980eb3)
    Signed-off-by: Sean Owen <[email protected]>

commit f946464155bb907482dc8d8a1b0964a925d04081
Author: Devaraj K <[email protected]>
Date:   2017-02-01T20:55:11Z

    [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED
    
    ## What changes were proposed in this pull request?
    
    Copying of the killed status was missing while getting the newTaskInfo 
object by dropping the unnecessary details to reduce the memory usage. This 
patch adds the copying of the killed status to newTaskInfo object, this will 
correct the display of the status from wrong status to KILLED status in Web UI.
    
    ## How was this patch tested?
    
    Current behaviour of displaying tasks in stage UI page,
    
    | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | 
Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle 
Write Size / Records | Errors |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
--- |
    |143        |10     |0      |SUCCESS        |NODE_LOCAL     |6 / x.xx.x.x 
stdout stderr|2017/01/25 07:49:27 |0 ms |         |0.0 B / 0              | 
|0.0 B / 0    |TaskKilled (killed intentionally)|
    |156        |11     |0      |SUCCESS        |NODE_LOCAL     |5 / x.xx.x.x 
stdout stderr|2017/01/25 07:49:27 |0 ms |         |0.0 B / 0              | 
|0.0 B / 0    |TaskKilled (killed intentionally)|
    
    Web UI display after applying the patch,
    
    | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | 
Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle 
Write Size / Records | Errors |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
--- |
    |143        |10     |0      |KILLED |NODE_LOCAL     |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms |         |0.0 B / 0              |  | 0.0 B / 
0  | TaskKilled (killed intentionally)|
    |156        |11     |0      |KILLED |NODE_LOCAL     |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms |         |0.0 B / 0              |  |0.0 B / 
0   | TaskKilled (killed intentionally)|
    
    Author: Devaraj K <[email protected]>
    
    Closes #16725 from devaraj-kavali/SPARK-19377.
    
    (cherry picked from commit df4a27cc5cae8e251ba2a883bcc5f5ce9282f649)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 7c23bd49e826fc2b7f132ffac2e55a71905abe96
Author: Shixiong Zhu <[email protected]>
Date:   2017-02-02T05:39:21Z

    [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout
    
    ## What changes were proposed in this pull request?
    
    When connecting timeout, `ask` may fail with a confusing message:
    
    ```
    17/02/01 23:15:19 INFO Worker: Connecting to master ...
    java.lang.IllegalArgumentException: requirement failed: TransportClient has 
not yet been set.
            at scala.Predef$.require(Predef.scala:224)
            at 
org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
            at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
            at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
            at 
scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
            at 
scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    ```
    
    It's better to provide a meaningful message.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16773 from zsxwing/connect-timeout.
    
    (cherry picked from commit 8303e20c45153f91e585e230caa29b728a4d8c6c)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit f55bd4c736b01f1fe3df0ca2f4582c8d2b4d77f9
Author: Herman van Hovell <[email protected]>
Date:   2017-02-06T20:28:13Z

    [SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a function 
call
    
    ## What changes were proposed in this pull request?
    The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function 
call. This happens in cases like the following:
    ```sql
    select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 
end
    from tb
    ```
    This PR fixes this by re-organizing the case related parsing rules.
    
    ## How was this patch tested?
    Added a regression test to the `ExpressionParserSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #16821 from hvanhovell/SPARK-19472.
    
    (cherry picked from commit cb2677b86039a75fcd8a4e567ab06055f054a19a)
    Signed-off-by: gatorsmile <[email protected]>

commit 62fab5beee147c90d8b7f8092b4ee76ba611ee8e
Author: uncleGen <[email protected]>
Date:   2017-02-07T05:03:20Z

    [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it 
from uri scheme
    
    ## What changes were proposed in this pull request?
    
    ```
    Caused by: java.lang.IllegalArgumentException: Wrong FS: 
s3a://**************/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
        at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
        at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
        at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
        at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
    ```
    
    Can easily replicate on spark standalone cluster by providing checkpoint 
location uri scheme anything other than "file://" and not overriding in config.
    
    WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in 
sparkConf or spark-default.conf
    
    ## How was this patch tested?
    
    existing ut
    
    Author: uncleGen <[email protected]>
    
    Closes #16815 from uncleGen/SPARK-19407.
    
    (cherry picked from commit 7a0a630e0f699017c7d0214923cd4aa0227e62ff)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit dd1abef138581f30ab7a8dfacb616fe7dd64b421
Author: Aseem Bansal <[email protected]>
Date:   2017-02-07T11:44:14Z

    [SPARK-19444][ML][DOCUMENTATION] Fix imports not being present in 
documentation
    
    ## What changes were proposed in this pull request?
    
    SPARK-19444 imports not being present in documentation
    
    ## How was this patch tested?
    
    Manual
    
    ## Disclaimer
    
    Contribution is original work and I license the work to the project under 
the projectâs open source license
    
    Author: Aseem Bansal <[email protected]>
    
    Closes #16789 from anshbansal/patch-1.
    
    (cherry picked from commit aee2bd2c7ee97a58f0adec82ec52e5625b39e804)
    Signed-off-by: Sean Owen <[email protected]>

commit e642a07d57798f98b25ba08ed7ae3abe0f597941
Author: Tyson Condie <[email protected]>
Date:   2017-02-07T22:31:23Z

    [SPARK-18682][SS] Batch Source for Kafka
    
    Today, you can start a stream that reads from kafka. However, given kafka's 
configurable retention period, it seems like sometimes you might just want to 
read all of the data that is available now. As such we should add a version 
that works with spark.read as well.
    The options should be the same as the streaming kafka source, with the 
following differences:
    startingOffsets should default to earliest, and should not allow latest 
(which would always be empty).
    endingOffsets should also be allowed and should default to latest. the same 
assign json format as startingOffsets should also be accepted.
    It would be really good, if things like .limit(n) were enough to prevent 
all the data from being read (this might just work).
    
    KafkaRelationSuite was added for testing batch queries via KafkaUtils.
    
    Author: Tyson Condie <[email protected]>
    
    Closes #16686 from tcondie/SPARK-18682.
    
    (cherry picked from commit 8df444403489aec0d68f7d930afdc4f7d50e0b41)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 706d6c154d2471c00253bf9b0c4e867752f841fe
Author: CodingCat <[email protected]>
Date:   2017-02-08T04:25:18Z

    [SPARK-19499][SS] Add more notes in the comments of Sink.addBatch()
    
    ## What changes were proposed in this pull request?
    
    addBatch method in Sink trait is supposed to be a synchronous method to 
coordinate with the fault-tolerance design in StreamingExecution (being 
different with the compute() method in DStream)
    
    We need to add more notes in the comments of this method to remind the 
developers
    
    ## How was this patch tested?
    
    existing tests
    
    Author: CodingCat <[email protected]>
    
    Closes #16840 from CodingCat/SPARK-19499.
    
    (cherry picked from commit d4cd975718716be11a42ce92a47c45be1a46bd60)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 4d040297f55243703463ea71d5302bb46ea0bf3f
Author: manugarri <[email protected]>
Date:   2017-02-08T05:45:33Z

    [MINOR][DOC] Remove parenthesis in readStream() on kafka structured 
streaming doc
    
    There is a typo in 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream
 , python example n1 uses `readStream()` instead of `readStream`
    
    Just removed the parenthesis.
    
    Author: manugarri <[email protected]>
    
    Closes #16836 from manugarri/fix_kafka_python_doc.
    
    (cherry picked from commit 5a0569ce693c635c5fa12b2de33ed3643ce888e3)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 71b6eacf72fb50862d33a2bf6a0662d6c4e73bbd
Author: Herman van Hovell <[email protected]>
Date:   2017-02-08T07:35:15Z

    [SPARK-18609][SPARK-18841][SQL][BACKPORT-2.1] Fix redundant Alias removal 
in the optimizer
    
    This is a backport of 
https://github.com/apache/spark/commit/73ee73945e369a862480ef4ac64e55c797bd7d90
    
    ## What changes were proposed in this pull request?
    The optimizer tries to remove redundant alias only projections from the 
query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies 
removes such a project and rewrites the project's attributes in the **entire** 
tree. This causes problems when parts of the tree are duplicated (for instance 
a self join on a temporary view/CTE)  and the duplicated part contains the 
alias only project, in this case the rewrite will break the tree.
    
    This PR fixes these problems by using a blacklist for attributes that are 
not to be moved, and by making sure that attribute remapping is only done for 
the parent tree, and not for unrelated parts of the query plan.
    
    The current tree transformation infrastructure works very well if the 
transformation at hand requires little or a global contextual information. In 
this case we need to know both the attributes that were not to be moved, and we 
also needed to know which child attributes were modified. This cannot be done 
easily using the current infrastructure, and solutions typically involves 
transversing the query plan multiple times (which is super slow). I have moved 
around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much 
more straightforward; this basically allows you to manually traverse the tree.
    
    ## How was this patch tested?
    I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have 
added integration tests to the `SQLQueryTestSuite.union` and 
`SQLQueryTestSuite.cte` test cases.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #16843 from hvanhovell/SPARK-18609-2.1.

commit 502c927b8c8a99ef2adf4e6e1d7a6d9232d45ef5
Author: Tathagata Das <[email protected]>
Date:   2017-02-08T19:33:59Z

    [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for 
branch-2.1
    
    This is a follow up PR for merging #16758 to spark 2.1 branch
    
    ## What changes were proposed in this pull request?
    
    `mapGroupsWithState` is a new API for arbitrary stateful operations in 
Structured Streaming, similar to `DStream.mapWithState`
    
    *Requirements*
    - Users should be able to specify a function that can do the following
    - Access the input row corresponding to a key
    - Access the previous state corresponding to a key
    - Optionally, update or remove the state
    - Output any number of new rows (or none at all)
    
    *Proposed API*
    ```
    // ------------ New methods on KeyValueGroupedDataset ------------
    class KeyValueGroupedDataset[K, V] {
        // Scala friendly
        def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
KeyedState[S]) => U)
            def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
Iterator[V], KeyedState[S]) => Iterator[U])
        // Java friendly
           def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, 
S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
           def flatMapGroupsWithState[S, U](func: 
FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
resultEncoder: Encoder[U])
    }
    
    // ------------------- New Java-friendly function classes 
-------------------
    public interface MapGroupsWithStateFunction<K, V, S, R> extends 
Serializable {
      R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
    }
    public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends 
Serializable {
      Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws 
Exception;
    }
    
    // ---------------------- Wrapper class for state data 
----------------------
    trait KeyedState[S] {
        def exists(): Boolean
        def get(): S                    // throws Exception is state does not 
exist
        def getOption(): Option[S]
        def update(newState: S): Unit
        def remove(): Unit              // exists() will be false after this
    }
    ```
    
    Key Semantics of the State class
    - The state can be null.
    - If the state.remove() is called, then state.exists() will return false, 
and getOption will returm None.
    - After that state.update(newState) is called, then state.exists() will 
return true, and getOption will return Some(...).
    - None of the operations are thread-safe. This is to avoid memory barriers.
    
    *Usage*
    ```
    val stateFunc = (word: String, words: Iterator[String, runningCount: 
KeyedState[Long]) => {
        val newCount = words.size + runningCount.getOption.getOrElse(0L)
        runningCount.update(newCount)
       (word, newCount)
    }
    
    dataset                                                             // type 
is Dataset[String]
      .groupByKey[String](w => w)                               // generates 
KeyValueGroupedDataset[String, String]
      .mapGroupsWithState[Long, (String, Long)](stateFunc)      // returns 
Dataset[(String, Long)]
    ```
    
    ## How was this patch tested?
    New unit tests.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #16850 from tdas/mapWithState-branch-2.1.

commit b3fd36a15a0924b9de88dadc6e0acbe504ba4b96
Author: Shixiong Zhu <[email protected]>
Date:   2017-02-09T19:16:51Z

    [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in 
Signaling.cancelOnInterrupt
    
    ## What changes were proposed in this pull request?
    
    `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes 
ReplSuite unstable.
    
    This PR adds `SparkContext.getActive` to allow 
`Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the 
leak.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #16825 from zsxwing/SPARK-19481.
    
    (cherry picked from commit 303f00a4bf6660dd83c8bd9e3a107bb3438a421b)
    Signed-off-by: Davies Liu <[email protected]>

commit a3d5300a030fb5f1c275e671603e0745b6466735
Author: Stan Zhai <[email protected]>
Date:   2017-02-09T20:01:25Z

    [SPARK-19509][SQL] Grouping Sets do not respect nullable grouping columns
    
    ## What changes were proposed in this pull request?
    The analyzer currently does not check if a column used in grouping sets is 
actually nullable itself. This can cause the nullability of the column to be 
incorrect, which can cause null pointer exceptions down the line. This PR fixes 
that by also consider the nullability of the column.
    
    This is only a problem for Spark 2.1 and below. The latest master uses a 
different approach.
    
    Closes https://github.com/apache/spark/pull/16874
    
    ## How was this patch tested?
    Added a regression test to `SQLQueryTestSuite.grouping_set`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #16873 from hvanhovell/SPARK-19509.

commit ff5818b8cee7c718ef5bdef125c8d6971d64acde
Author: Bogdan Raducanu <[email protected]>
Date:   2017-02-10T09:50:07Z

    [SPARK-19512][BACKPORT-2.1][SQL] codegen for compare structs fails #16852
    
    ## What changes were proposed in this pull request?
    
    Set currentVars to null in GenerateOrdering.genComparisons before genCode 
is called. genCode ignores INPUT_ROW if currentVars is not null and in 
genComparisons we want it to use INPUT_ROW.
    
    ## How was this patch tested?
    
    Added test with 2 queries in WholeStageCodegenSuite
    
    Author: Bogdan Raducanu <[email protected]>
    
    Closes #16875 from bogdanrdc/SPARK-19512-2.1.

commit 7b5ea000e246f7052e7324fd7f2e99f32aaece17
Author: Burak Yavuz <[email protected]>
Date:   2017-02-10T11:55:06Z

    [SPARK-19543] from_json fails when the input row is empty
    
    ## What changes were proposed in this pull request?
    
    Using from_json on a column with an empty string results in: 
java.util.NoSuchElementException: head of empty list.
    
    This is because `parser.parse(input)` may return `Nil` when 
`input.trim.isEmpty`
    
    ## How was this patch tested?
    
    Regression test in `JsonExpressionsSuite`
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #16881 from brkyvz/json-fix.
    
    (cherry picked from commit d5593f7f5794bd0343e783ac4957864fed9d1b38)
    Signed-off-by: Herman van Hovell <[email protected]>

commit e580bb035236dd92ade126af6bb98288d88179c4
Author: Andrew Ray <[email protected]>
Date:   2016-12-13T07:49:22Z

    [SPARK-18717][SQL] Make code generation for Scala Map work with 
immutable.Map also
    
    ## What changes were proposed in this pull request?
    
    Fixes compile errors in generated code when user has case class with a 
`scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since 
ArrayBasedMapData.toScalaMap returns the immutable version we can make it work 
with both.
    
    ## How was this patch tested?
    
    Additional unit tests.
    
    Author: Andrew Ray <[email protected]>
    
    Closes #16161 from aray/fix-map-codegen.
    
    (cherry picked from commit 46d30ac4846b3ec94426cc482c42cff72ebd6d92)
    Signed-off-by: Cheng Lian <[email protected]>

commit 173c2387a38b260b46d7646b332e404f6ebe1a17
Author: titicaca <[email protected]>
Date:   2017-02-12T18:42:15Z

    [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp 
column
    
    ## What changes were proposed in this pull request?
    
    Fix a bug in collect method for collecting timestamp column, the bug can be 
reproduced as shown in the following codes and outputs:
    
    ```
    library(SparkR)
    sparkR.session(master = "local")
    df <- data.frame(col1 = c(0, 1, 2),
                     col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, 
as.POSIXct("2017-01-01 12:00:01")))
    
    sdf1 <- createDataFrame(df)
    print(dtypes(sdf1))
    df1 <- collect(sdf1)
    print(lapply(df1, class))
    
    sdf2 <- filter(sdf1, "col1 > 0")
    print(dtypes(sdf2))
    df2 <- collect(sdf2)
    print(lapply(df2, class))
    ```
    
    As we can see from the printed output, the column type of col2 in df2 is 
converted to numeric unexpectedly, when NA exists at the top of the column.
    
    This is caused by method `do.call(c, list)`, if we convert a list, i.e. 
`do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the 
result is numeric instead of POSIXct.
    
    Therefore, we need to cast the data type of the vector explicitly.
    
    ## How was this patch tested?
    
    The patch can be tested manually with the same code above.
    
    Author: titicaca <[email protected]>
    
    Closes #16689 from titicaca/sparkr-dev.
    
    (cherry picked from commit bc0a0e6392c4e729d8f0e4caffc0bd05adb0d950)
    Signed-off-by: Felix Cheung <[email protected]>

commit 06e77e0097c6fa0accc5d9d6ce08a65a3828b878
Author: [email protected] <[email protected]>
Date:   2017-02-12T18:48:55Z

    [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error 
when the cluster size doesn't equal to k
    
    ## What changes were proposed in this pull request?
    
    Backport fix of #16666
    
    ## How was this patch tested?
    
    Backport unit tests
    
    Author: [email protected] <[email protected]>
    
    Closes #16761 from wangmiao1981/kmeansport.

commit fe4fcc5701cbd3f2e698e00f1cc7d49d5c7c702b
Author: Liwei Lin <[email protected]>
Date:   2017-02-13T07:00:22Z

    [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should 
not be in the same group
    
    ## What changes were proposed in this pull request?
    
    In `KafkaOffsetReader`, when error occurs, we abort the existing consumer 
and create a new consumer. In our current implementation, the first consumer 
and the second consumer would be in the same group (which leads to 
SPARK-19559), **_violating our intention of the two consumers not being in the 
same group._**
    
    The cause is that, in our current implementation, the first consumer is 
created before `groupId` and `nextId` are initialized in the constructor. Then 
even if `groupId` and `nextId` are increased during the creation of that first 
consumer, `groupId` and `nextId` would still be initialized to default values 
in the constructor for the second consumer.
    
    We should make sure that `groupId` and `nextId` are initialized before any 
consumer is created.
    
    ## How was this patch tested?
    
    Ran 100 times of `KafkaSourceSuite`; all passed
    
    Author: Liwei Lin <[email protected]>
    
    Closes #16902 from lw-lin/SPARK-19564-.
    
    (cherry picked from commit 2bdbc87052389ff69404347fbc69457132dbcafd)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit a3b6751375cf301dec156b85fe79e32b0797a24f
Author: Xiao Li <[email protected]>
Date:   2017-02-13T11:18:31Z

    [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount 
is not equal to end indices amount
    
    ### What changes were proposed in this pull request?
    ```
    Liquid Exception: Start indices amount is not equal to end indices amount, 
see 
/Users/xiao/IdeaProjects/sparkDelivery/docs/../examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java.
 in ml-features.md
    ```
    
    So far, the build is broken after merging 
https://github.com/apache/spark/pull/16789
    
    This PR is to fix it.
    
    ## How was this patch tested?
    Manual
    
    Author: Xiao Li <[email protected]>
    
    Closes #16908 from gatorsmile/docMLFix.
    
    (cherry picked from commit 855a1b7551c71b26ce7d9310342fefe0a87281ec)
    Signed-off-by: Sean Owen <[email protected]>

commit ef4fb7ebca963eb95d6a8bf7543e05aa375edc23
Author: zero323 <[email protected]>
Date:   2017-02-13T17:26:49Z

    [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util
    
    ## What changes were proposed in this pull request?
    
    Add missing `warnings` import.
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: zero323 <[email protected]>
    
    Closes #16846 from zero323/SPARK-19506.
    
    (cherry picked from commit 5e7cd3322b04f1dd207829b70546bc7ffdd63363)
    Signed-off-by: Holden Karau <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18782: Branch 2.1

Reply via email to