GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/22042

    [SPARK-25005][SS]Support non-consecutive offsets for Kafka

    ## What changes were proposed in this pull request?
    
    As the user uses Kafka transactions to write data, the offsets in Kafka 
will be non-consecutive. It will contains some transaction (commit or abort) 
markers. In addition, if the consumer's `isolation.level` is `read_committed`, 
`poll` will not return aborted messages either. Hence, we will see 
non-consecutive offsets in the date returned by `poll`. However, as `seekToEnd` 
may move the offset point to these missing offsets, there are 4 possible corner 
cases we need to support:
    
    - The whole batch contains no data messages
    - The first offset in a batch is not a committed data message
    - The last offset in a batch is not a committed data message
    - There is a gap in the middle of a batch
    
    They are all covered by the new unit tests.
    
    ## How was this patch tested?
    
    The new unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark kafka-transaction-read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22042
    
----
commit dc18a6ff59fe7c48ed188a4eb9a6abf04caee0bd
Author: Shixiong Zhu <zsxwing@...>
Date:   2018-08-08T17:40:37Z

    Support non-consecutive offsets for Kafka

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to