Shixiong Zhu created SPARK-28650: ------------------------------------ Summary: Fix the guarantee of ForeachWriter Key: SPARK-28650 URL: https://issues.apache.org/jira/browse/SPARK-28650 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Shixiong Zhu
Right now ForeachWriter has the following guarantee: {code} If the streaming query is being executed in the micro-batch mode, then every partition represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data. Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data and achieve exactly-once guarantees. {code} But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark) * Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2). * A new added optimization rule may change the number of partitions in the new run. * Change the file split size in the new run. Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter". -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org