Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26088133
--- Diff: docs/streaming-programming-guide.md ---
@@ -1933,17 +2057,24 @@ The following table summarizes the semantics under
failures:
</tr>
</table>
+### With Kafka Direct API
+{:.no_toc}
+In Spark 1.3, we have introduced a new Kafka Direct API, which can ensure
that all the Kafka data is received by Spark Streaming exactly once. Along with
this, if you implement exactly-once output operation, you can achieve
end-to-end exactly-once guarantees. This approach (experimental as of Spark
1.3) is further discussed in the [Kafka Integration
Guide](stream-kafka-integration.html).
+
## Semantics of output operations
{:.no_toc}
-Since all data is modeled as RDDs with their lineage of deterministic
operations, any recomputation
- always leads to the same result. As a result, all DStream transformations
are guaranteed to have
- _exactly-once_ semantics. That is, the final transformed result will be
same even if there were
- was a worker node failure. However, output operations (like `foreachRDD`)
have _at-least once_
- semantics, that is, the transformed data may get written to an external
entity more than once in
- the event of a worker failure. While this is acceptable for saving to
HDFS using the
- `saveAs***Files` operations (as the file will simply get over-written by
the same data),
- additional transactions-like mechanisms may be necessary to achieve
exactly-once semantics
- for output operations.
+Output operations (like `foreachRDD`) have _at-least once_ semantics, that
is,
+the transformed data may get written to an external entity more than once
in
+the event of a worker failure. While this is acceptable for saving to file
systems using the
+`saveAs***Files` operations (as the file will simply get over-written with
the same data),
--- End diff --
overwritten doesn't need to be hyphenated.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]