[
https://issues.apache.org/jira/browse/SPARK-41942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-41942:
---------------------------------
Component/s: Structured Streaming
(was: Java API)
> Current microbatch model is insufficient to efficiently handle partitioned
> logs with offsets
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-41942
> URL: https://issues.apache.org/jira/browse/SPARK-41942
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 3.2.3
> Reporter: Daniel Collins
> Priority: Major
>
> Spark microbatch only allows unidirectional flow of information from the
> driver to the worker, which has multiple negative effects for partition and
> offset based sources.
> If you have large gaps in the offsets between the current offset and the next
> message (such as when there is garbage collection, compaction or a
> newly-created subscription with a 0 offset seek point), the driver will need
> to work through all of the empty ranges, despite the worker knowing each time
> that the next message it will receive is at a much later offset.
> In addition, because the driver can only specify a range of work in terms of
> offsets, the size of the range selected must be highly tuned to the size of
> messages in the backlog- when the size changes or is variable, there can be
> no correct selection that both allows high throughput and doesn't overload
> the driver when messages are large.
> To resolve this, there needs to be a channel to communicate information about
> what messages were processed back to the driver code. This would allow simply
> resolving both of the above issues by 1) sending the offset of the last read
> message back to the driver and 2) structuring reads as <start offset, head
> offset, byte limit> instead of <start offset, end offset>, which would allow
> repeatable reads that also can respond to messages of different sizes by
> making different sized batches, and would allow skipping large offset gaps.
> This would have the added advantage of allowing non-offset based systems
> (such as google Pub/Sub) to implement the microbatch API, by propagating
> information about which messages were read back to the driver code.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]