Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/6863#discussion_r32788398
--- Diff: docs/streaming-kafka-integration.md ---
@@ -161,6 +161,8 @@ Next, we discuss how to use this approach in your
streaming application.
You can use this to update Zookeeper yourself if you want
Zookeeper-based Kafka monitoring tools to show progress of the streaming
application.
- Another thing to note is that since this approach does not use
Receivers, the standard receiver-related (that is,
[configurations](configuration.html) of the form `spark.streaming.receiver.*` )
will not apply to the input DStreams created by this approach (will apply to
other input DStreams though). Instead, use the
[configurations](configuration.html) `spark.streaming.kafka.*`. An important
one is `spark.streaming.kafka.maxRatePerPartition` which is the maximum rate at
which each Kafka partition will be read by this direct API.
+ Note that the typecast to HasOffsetRanges will only succeed if it is
done in the first method called on the directKafkaStream, not later down a
chain of methods. You can use transform() instead of foreachRDD() as your first
method call in order to access offsets, then call further Spark methods.
However, be aware that the one-to-one mapping between RDD partition and Kafka
partition does not remain after any methods that shuffle or repartition, e.g.
reduceByKey() or window().
--- End diff --
Good addition. :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]