[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...

tdas Thu, 02 Jul 2015 13:15:07 -0700

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/7185#issuecomment-118153576
  
    I agree with most of @koeninger points. I think the issue is that there is 
that the python KafkaDirectStream generates python RDDs which is effectively 
[python map on the JavaPairRDD wrapping the KafkaRDD].
    
    How about this. 
    
    * Create a KafkaRDD class in python which is a subclass of python's RDD 
class. It adds the method getOffsetRanges(), which will find the offset ranges 
by traversing through the from the Scala RDD inside associated JavaPairRDD.
    
    * Create a KafkaDirectDStream in python which is a subclass of python's 
DStream, and override transform and foreachRDD method to apply the user 
specified function on the python's KafkaRDD objects instead of RDD objects. 
This will ensure that the user will be able to call `rdd.getOffsetRanges()` 
inside the transform and foreach functions.
    
    With this I think it is doable to maintain consistency with Scala API, that 
is, `rdd.getOffsetRanges()` in python gives the offset ranges.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...

Reply via email to