[GitHub] [spark] sunchao commented on a change in pull request #31089: [MINOR][SS] Add some description about auto reset and data loss note to SS doc

GitBox Thu, 07 Jan 2021 23:17:41 -0800


sunchao commented on a change in pull request #31089:
URL: https://github.com/apache/spark/pull/31089#discussion_r553778269




##########
File path: docs/structured-streaming-kafka-integration.md
##########
@@ -878,7 +878,14 @@ group id, however, please read warnings for this option 
and use it with caution.
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather
  than rely on the kafka Consumer to do it. This will ensure that no data is 
missed when new
  topics/partitions are dynamically subscribed. Note that `startingOffsets` 
only applies when a new
- streaming query is started, and that resuming will always pick up from where 
the query left off.
+ streaming query is started, and that resuming will always pick up from where 
the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka 
(e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention 
period), the offsets

Review comment:
       "offset retention period" : not sure if the offset is redundant.
   
   Also, perhaps "the offsets are not reset" -> "they will not be reset".

##########
File path: docs/structured-streaming-kafka-integration.md
##########
@@ -878,7 +878,14 @@ group id, however, please read warnings for this option 
and use it with caution.
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather
  than rely on the kafka Consumer to do it. This will ensure that no data is 
missed when new
  topics/partitions are dynamically subscribed. Note that `startingOffsets` 
only applies when a new
- streaming query is started, and that resuming will always pick up from where 
the query left off.
+ streaming query is started, and that resuming will always pick up from where 
the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka 
(e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention 
period), the offsets
+ are not reset and the streaming application will see data lost. In extreme 
cases, for example the

Review comment:
       "see data lost" -> "see data loss"

##########
File path: docs/structured-streaming-kafka-integration.md
##########
@@ -878,7 +878,14 @@ group id, however, please read warnings for this option 
and use it with caution.
  where to start instead. Structured Streaming manages which offsets are 
consumed internally, rather
  than rely on the kafka Consumer to do it. This will ensure that no data is 
missed when new
  topics/partitions are dynamically subscribed. Note that `startingOffsets` 
only applies when a new
- streaming query is started, and that resuming will always pick up from where 
the query left off.
+ streaming query is started, and that resuming will always pick up from where 
the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka 
(e.g., topics are deleted,

Review comment:
       "is not in" -> "are not in"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #31089: [MINOR][SS] Add some description about auto reset and data loss note to SS doc

Reply via email to