GitHub user jerryshao opened a pull request:
https://github.com/apache/spark/pull/7185
[SPARK-8389][Streaming][PySpark] Expose KafkaRDDs offsetRange in Python
This PR propose a simple way to expose OffsetRange in Python code, also the
usage of offsetRanges is similar to Scala/Java way, here in Python we could get
OffsetRange like:
```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```
Reason I didn't follow the way what SPARK-8389 suggested is that: Python
Kafka API has one more step to decode the message compared to Scala/Java, Which
makes Python API return a transformed RDD/DStream, not directly wrapped
so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get
the offsetRange.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jerryshao/apache-spark SPARK-8389
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7185.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7185
----
commit 848c708246a632aea19e3a992d28973b6d31df10
Author: jerryshao <[email protected]>
Date: 2015-06-25T02:28:51Z
Add HasOffsetRanges for Python
commit 2aabf9e1b7b8414aab3428cbbab772504b66078f
Author: jerryshao <[email protected]>
Date: 2015-07-02T07:58:09Z
Style fix
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]