Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/7185#issuecomment-118153576
I agree with most of @koeninger points. I think the issue is that there is
that the python KafkaDirectStream generates python RDDs which is effectively
[python map on the JavaPairRDD wrapping the KafkaRDD].
How about this.
* Create a KafkaRDD class in python which is a subclass of python's RDD
class. It adds the method getOffsetRanges(), which will find the offset ranges
by traversing through the from the Scala RDD inside associated JavaPairRDD.
* Create a KafkaDirectDStream in python which is a subclass of python's
DStream, and override transform and foreachRDD method to apply the user
specified function on the python's KafkaRDD objects instead of RDD objects.
This will ensure that the user will be able to call `rdd.getOffsetRanges()`
inside the transform and foreach functions.
With this I think it is doable to maintain consistency with Scala API, that
is, `rdd.getOffsetRanges()` in python gives the offset ranges.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]