tomasbartalos commented on a change in pull request #23749: [SPARK-26841][SQL]
Kafka timestamp pushdown
URL: https://github.com/apache/spark/pull/23749#discussion_r266477078
##########
File path:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala
##########
@@ -90,10 +94,12 @@ private[kafka010] class KafkaRelation(
// Calculate offset ranges
val offsetRanges = untilPartitionOffsets.keySet.map { tp =>
val fromOffset = fromPartitionOffsets.getOrElse(tp,
- // This should not happen since topicPartitions contains all
partitions not in
- // fromPartitionOffsets
- throw new IllegalStateException(s"$tp doesn't have a from offset"))
- val untilOffset = untilPartitionOffsets(tp)
+ // This should not happen since topicPartitions contains all
partitions not in
+ // fromPartitionOffsets
+ throw new IllegalStateException(s"$tp doesn't have a from offset")
+ }
+ var untilOffset = untilPartitionOffsets(tp)
+ untilOffset = if (areOffsetsInLine(fromOffset, untilOffset)) untilOffset
else fromOffset
Review comment:
The nonsensical ranges originate from wrong (contradictory) user queries
(example: `timestamp > 10 and timestamp < 10`). Question is how you want to
react to this kind of queries.
1. If we don't handle them then user will see an error:
`You either provided an invalid fromOffset, or the Kafka topic has been
damaged`
2. If we do handle them, user will get empty result set and no error.
I'm more fan of option 2), since this is how most of DB would react, but if
you disagree I can delete the handling.
Maybe method naming `areOffsetsInLine` could be improved ?
If I uncomment the line
`untilOffset = if (areOffsetsInLine(fromOffset, untilOffset)) untilOffset
else fromOffset`
Then 2 unit tests fails:
- `timestamp pushdown with contradictory condition` - query like `timestamp
> 10 and timestamp < 10`
- `timestamp pushdown out of offset range` - this is for cases where DS
option specifies offset range and the timestamp filter is valid but out of DS
option range.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]