Hey Anh,

You might also want to have a look at:

  https://issues.apache.org/jira/browse/SAMZA-138

This is a basic implementation of a SystemConsumer that reads messages
from a file, line by line, and feeds them to Samza.

Cheers,
Chris

On 2/27/14 7:50 AM, "Garry Turkington" <[email protected]>
wrote:

>Hi Casey,
>
>Yep, your explanation is correct.
>
>Even though Samza is currently very much associated with Kafka the actual
>model is very system neutral. If you look at the SystemConsumer interface
>it has a poll method that returns a List<IncomingMessageEnvelope>. The
>Samza framework calls that method on the concrete consumer implementation
>which in this case is either going to be retrieving messages from a Kafka
>topic or the Wikipedia feed. Then when the task writes messages out via
>the Collector.write method they are passed through the similarly named
>method in SystemProducer.
>
>In the task config file the class for the system factory is specified and
>that's the mechanism to generate the actual SystemConsumer and
>SystemProducer objects.  Then if you needed an actual Samza system to
>produce messages directly that are read by a task you'd start by
>implementing the SystemFactory and SystemConsumer. If you just use an
>external client to push data straight into Kafka then from Samza's
>perspective it is just reading and writing from the same (Kafka) system
>and has no visibility of the externals.
>
>Hope that helps, I'm still getting my head around some of this stuff
>myself!
>Garry
>
>-----Original Message-----
>From: Anh Thu Vu [mailto:[email protected]]
>Sent: 27 February 2014 14:00
>To: [email protected]
>Subject: Re: Read from a file and write to a Kafka stream with Samza
>
>Hi Garry,
>
>First of all thank for the link.
>
>After skimming through the archive and reading your reply, I think I
>misunderstood what are the streams in wikipedia system.
>
>So, in hello-samza, the WikipediaConsumer produces some wikipedia.*
>streams which are read by Wikipedia feed task. Wikipedia feed task then
>write messages from those wikipedia.* streams to kafka.wikipedia-raw.
>
>So my first impression of the flow of the system is: (Data file) =>
>(wikipedia.* streams) => (kafka.wikipedia-raw) where wikipedia.* streams
>are something "physical", something like the kafka streams.
>
>But now after thinking more about it, I guess the system works as follow:
>WikipediaConsumer listens to the IRC feed, queue the messages it receives
>in a in-memory queue (or something in-memory).
>The Samza periodically polls WikipediaConsumer for messages for each of
>the
>wikipedia.* streams, sends the messages to Wikipedia feed task.
>Wikipedia feed task writes messages to kafka stream.
>
>In a way, those wikipedia streams are just names that the Samza uses to
>request WikipediaConsumer for messages. How to messages are
>buffered/queued/stored before they are polled by Samza is up to
>WikipediaConsumer.
>
>Then if my understanding is correct, and according to Chris' reply in
>that thread. I have 2 options:
>1) Keep it as it is now and run the Wikipedia feed task as a Samza job
>2) Implement a Kafka producer that reads from external source and write
>to a Kafka stream/topic. This will purely just an implementation on Kafka
>and has nothing to do with Samza.
>
>Correct me if I'm wrong. And thank you so much for the reply.
>Casey
>
>
>On Thu, Feb 27, 2014 at 1:12 PM, Garry Turkington <
>[email protected]> wrote:
>
>> Hi Casey,
>>
>> Here's a link to a response from Chris on that thread which I think
>> gives the high-level picture pretty well:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201310.mb
>> ox/%3C1B43C7411DB20E47AB0FB62E7262B80179B2159F%40ESV4-MBX02.linkedin.b
>> iz%3E
>>
>> Basically it's a choice between building a system client within Samza
>> that pulls the external data or having an external process that pushes
>> the data into a Kafka topic that is then the input to a Samza task.
>> Which is most appropriate will depend on the nature of the data source
>> plus other considerations such as if the raw data will have other
>>consumers.
>>
>> Though I'm not quite sure I follow what you mean re your  implemented
>> custom system? If you look at the source for the Wikipedia feed task
>> it reads from the external source (Wikipedia) then writes the output
>> message directly to Kafka (the call to collector.send) -- isn't this
>>what you  want?
>>
>> Regards,
>> Garry
>>
>> -----Original Message-----
>> From: Anh Thu Vu [mailto:[email protected]]
>> Sent: 27 February 2014 11:32
>> To: [email protected]
>> Subject: Read from a file and write to a Kafka stream with Samza
>>
>> Hi everyone,
>>
>> As the subject said, I want to read from external source (simplest
>> case is to read from a file) and write to a Kafka stream. Then another
>> StreamTask will start reading from the Kafka stream.
>>
>> I've succeeded running HelloSamza, write a similar app (VERY similar)
>> that have a custom Consumer reading from a file then write to a custom
>> system
>> (i.e.: testsystem.mystream). Then I have a StreamTask that read from
>> this custom stream, and write to a kafka stream. However, I want to
>> bypass the custom stream and write the message from the external
>> source directly to the Kafka stream.
>>
>> I guess I will have to implement the SystemFactory such that it will
>> return a Kafka producer for the getProducer() method but I'm not very
>> sure how to yet. Although I basically welcome & appreciate very much
>> all guides/advises/suggestions, my main objective of this mail is to
>> ask for the link to the thread "Writing a simple KafkaProducer in
>> Samza" that was mentioned in
>>
>>
>> https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201311.m
>> box/%3CEA1B8C30F3B4C34EBA71F2AAF48F5990D612E028%40Mail3.impetus.co.in%
>> 3E
>>
>> Thank you very much,
>> Casey
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date:
>> 02/26/14
>>
>
>-----
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: 02/26/14

Reply via email to