GitHub user zsxwing opened a pull request:
https://github.com/apache/spark/pull/22042
[SPARK-25005][SS]Support non-consecutive offsets for Kafka
## What changes were proposed in this pull request?
As the user uses Kafka transactions to write data, the offsets in Kafka
will be non-consecutive. It will contains some transaction (commit or abort)
markers. In addition, if the consumer's `isolation.level` is `read_committed`,
`poll` will not return aborted messages either. Hence, we will see
non-consecutive offsets in the date returned by `poll`. However, as `seekToEnd`
may move the offset point to these missing offsets, there are 4 possible corner
cases we need to support:
- The whole batch contains no data messages
- The first offset in a batch is not a committed data message
- The last offset in a batch is not a committed data message
- There is a gap in the middle of a batch
They are all covered by the new unit tests.
## How was this patch tested?
The new unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zsxwing/spark kafka-transaction-read
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22042.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22042
----
commit dc18a6ff59fe7c48ed188a4eb9a6abf04caee0bd
Author: Shixiong Zhu <zsxwing@...>
Date: 2018-08-08T17:40:37Z
Support non-consecutive offsets for Kafka
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]