Re: [PR] [WIP][feature][spark] Support streaming [seatunnel]

via GitHub Wed, 28 Aug 2024 05:39:17 -0700


CheneyYin commented on PR #7476:
URL: https://github.com/apache/seatunnel/pull/7476#issuecomment-2315208883


   > > 
https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/micro/SeaTunnelMicroBatchPartitionReader.java#L27-L49
   > > 
   > > 
https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/batch/ParallelBatchPartitionReader.java#L87-L97
   > > 
   > > PartitionReader never close in streaming mode.
   > 
   > hi @CheneyYin It seems that after a checkpoint, it will be close
   
   Yes. If the reader does not receive new data for a long time, Spark will end 
the current micro batch. Spark's micro batch mechanism does not fully meet the 
requirements of long term streaming computing. First, creating a new reader for 
the next batch will incur some overhead. Second, the granularity of fault 
recovery is too large, and the Spark micro batch mechanism cannot restore the 
reader from the latest snapshot of the Seatunnel reader.
   I am looking for strategies to alleviate these problems while ensuring fault 
recovery. Currently, I add metadata to the seatunnel row and use a special 
identifier to represent the checkpoint event. After the source completes a 
checkpoint, it will create a checkpoint record and send it to the downstream. 
After receiving the checkpoint record, the sink saves the snapshot and confirms 
the prepared checkpoint made by the source. These checkpoint operations are 
performed based on the file system directory space.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [WIP][feature][spark] Support streaming [seatunnel]

Reply via email to