[ 
https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Whitacre updated CRUNCH-606:
----------------------------------
    Attachment: CRUNCH-606.patch

So the approach I'm going with the source is that callers are going to be 
responsible for managing start/stop offsets.  They are also responsible for 
persisting that information somewhere.  I think in a later request I'll work on 
adding some convenience methods for that.

Posting progress here for comments or feedback.  There is some work to be done, 
specifically:
* More tests around some of the utilities in KafkaUtils.
* Finish flushing out the KafkaSourceIT (hitting a ClassCastException from 
String to Text)
* Finish flushing out the KafkaSource.read(...) method because it stops us from 
being able to call materialize on a read PCollection.
* Reogranize the tests because right now the "long running" (but not really) 
tests are in surefire vs our convention of putting them in src/it/tests

One specific feedback is if I should keep with the path of making the 
KafkaSource extend FileInputFormat or if I'm trying to put a square peg in a 
round hole.  

> Create a KafkaSource
> --------------------
>
>                 Key: CRUNCH-606
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-606
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it 
> Kafka Connect, Camus, Gobblin do not integrate nicely with existing 
> processing pipelines like Crunch.  With Kafka 0.9, the consuming API is a lot 
> easier so we should build a Source implementation that can read from Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to