[
https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Micah Whitacre updated CRUNCH-606:
----------------------------------
Attachment: CRUNCH-606.patch
So the approach I'm going with the source is that callers are going to be
responsible for managing start/stop offsets. They are also responsible for
persisting that information somewhere. I think in a later request I'll work on
adding some convenience methods for that.
Posting progress here for comments or feedback. There is some work to be done,
specifically:
* More tests around some of the utilities in KafkaUtils.
* Finish flushing out the KafkaSourceIT (hitting a ClassCastException from
String to Text)
* Finish flushing out the KafkaSource.read(...) method because it stops us from
being able to call materialize on a read PCollection.
* Reogranize the tests because right now the "long running" (but not really)
tests are in surefire vs our convention of putting them in src/it/tests
One specific feedback is if I should keep with the path of making the
KafkaSource extend FileInputFormat or if I'm trying to put a square peg in a
round hole.
> Create a KafkaSource
> --------------------
>
> Key: CRUNCH-606
> URL: https://issues.apache.org/jira/browse/CRUNCH-606
> Project: Crunch
> Issue Type: New Feature
> Components: IO
> Reporter: Micah Whitacre
> Assignee: Micah Whitacre
> Attachments: CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it
> Kafka Connect, Camus, Gobblin do not integrate nicely with existing
> processing pipelines like Crunch. With Kafka 0.9, the consuming API is a lot
> easier so we should build a Source implementation that can read from Kafka.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)