Github user zsxwing commented on a diff in the pull request:
https://github.com/apache/spark/pull/20253#discussion_r161349767
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
---
@@ -192,11 +194,26 @@ class KafkaContinuousDataReader(
override def next(): Boolean = {
var r: ConsumerRecord[Array[Byte], Array[Byte]] = null
while (r == null) {
- r = consumer.get(
- nextKafkaOffset,
- untilOffset = Long.MaxValue,
- pollTimeoutMs = Long.MaxValue,
- failOnDataLoss)
+ if (TaskContext.get().isInterrupted() ||
TaskContext.get().isCompleted()) return false
+ // Our consumer.get is not interruptible, so we have to set a low
poll timeout, leaving
+ // interrupt points to end the query rather than waiting for new
data that might never come.
+ try {
+ r = consumer.get(
+ nextKafkaOffset,
+ untilOffset = Long.MaxValue,
+ pollTimeoutMs = 1000,
+ failOnDataLoss)
+ } catch {
+ // We didn't read within the timeout. We're supposed to block
indefinitely for new data, so
+ // swallow and ignore this.
+ case _: TimeoutException =>
+ // Data loss is reported on both sides - but we expect scenarios
where the offset we're
+ // looking for isn't available yet. Don't propagate the exception
unless it's because
+ // the offset we're looking for is below the available range.
+ case e: IllegalStateException
+ if e.getCause.isInstanceOf[OffsetOutOfRangeException] &&
+ !(consumer.getAvailableOffsetRange().earliest >
nextKafkaOffset) =>
--- End diff --
I would change this condition to `this.getAvailableOffsetRange().latest ==
nextKafkaOffset` for safety.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]