Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3653#discussion_r21652896
--- Diff: docs/streaming-custom-receivers.md ---
@@ -191,9 +196,68 @@ The full source code is in the example
[JavaCustomReceiver.java](https://github.
</div>
</div>
-
-
-### Implementing and Using a Custom Actor-based Receiver
+## Receiver Reliability
+As discussed in brief in the
+[Spark Streaming Programming
Guide](streaming-programming-guide.html#receiver-reliability),
+there are two kinds of receivers based on their reliability and
fault-tolerance semantics.
+
+1. *Reliable Receiver* - For *reliable sources* that allow sent data to be
acknowledged, a
+ *reliable receiver* correctly acknowledges to the source that the data
has been received
+ and stored in Spark reliably (that is, replicated successfully). Usually,
+ implementing this receiver involves careful consideration of the
semantics of source
+ acknowledgements.
+1. *Unreliable Receiver* - These are receivers for unreliable sources that
do not support
+ acknowledging. Even for reliable sources, one may implement an
unreliable receiver that
+ do not go into the complexity of acknowledging correctly.
+
+To implement a *reliable receiver*, you have to use
`store(multiple-records)` to store data.
+This flavour of `store` is a blocking call which returns only after all
the given records have
+been stored inside Spark. If replication is enabled receiver's configured
storage level
+(enabled by default), then this call returns after replication has
completed.
+Thus it ensures that the data is reliably stored, and the receiver can now
acknowledge the
+source appropriately. This ensures that no data is caused when the
receiver fails in the middle
+of replicating data -- the buffered data will not be acknowledged and
hence will be later resent
+by the source.
+
+An *unreliable receiver* does not have to implement any of this logic. It
can simply receive
+records from the source and insert them one-at-a-time using
`store(single-record)`. While it does
+not get the reliability guarantees of `store(multiple-records)`, it has
the following advantages.
+
+- The system takes care of chunking that data into appropriate sized
blocks (look for block
+interval in the [Spark Streaming Programming
Guide](streaming-programming-guide.html)).
+- The system takes care of controlling the receiving rates if the rate
limits have been specified.
+- Because of these two, *unreliable receivers are simpler to implement
than reliable receivers.
+
+The following table summarizes the characteristics of both types of
receivers
+
+<table class="table">
+<tr>
+ <th>Receiver Type</th>
+ <th>Characteristics</th>
+</tr>
+<tr>
+ <td><b>Unreliable Receivers</b></td>
+ <td>
+ Simple to implement.<br>
+ System takes care of block generation and rate control.
+ No fault-tolerance guarantees, can loose data on receiver failure.
--- End diff --
loose -> lose
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]