Hi Casey, Yep, your explanation is correct.
Even though Samza is currently very much associated with Kafka the actual model is very system neutral. If you look at the SystemConsumer interface it has a poll method that returns a List<IncomingMessageEnvelope>. The Samza framework calls that method on the concrete consumer implementation which in this case is either going to be retrieving messages from a Kafka topic or the Wikipedia feed. Then when the task writes messages out via the Collector.write method they are passed through the similarly named method in SystemProducer. In the task config file the class for the system factory is specified and that's the mechanism to generate the actual SystemConsumer and SystemProducer objects. Then if you needed an actual Samza system to produce messages directly that are read by a task you'd start by implementing the SystemFactory and SystemConsumer. If you just use an external client to push data straight into Kafka then from Samza's perspective it is just reading and writing from the same (Kafka) system and has no visibility of the externals. Hope that helps, I'm still getting my head around some of this stuff myself! Garry -----Original Message----- From: Anh Thu Vu [mailto:[email protected]] Sent: 27 February 2014 14:00 To: [email protected] Subject: Re: Read from a file and write to a Kafka stream with Samza Hi Garry, First of all thank for the link. After skimming through the archive and reading your reply, I think I misunderstood what are the streams in wikipedia system. So, in hello-samza, the WikipediaConsumer produces some wikipedia.* streams which are read by Wikipedia feed task. Wikipedia feed task then write messages from those wikipedia.* streams to kafka.wikipedia-raw. So my first impression of the flow of the system is: (Data file) => (wikipedia.* streams) => (kafka.wikipedia-raw) where wikipedia.* streams are something "physical", something like the kafka streams. But now after thinking more about it, I guess the system works as follow: WikipediaConsumer listens to the IRC feed, queue the messages it receives in a in-memory queue (or something in-memory). The Samza periodically polls WikipediaConsumer for messages for each of the wikipedia.* streams, sends the messages to Wikipedia feed task. Wikipedia feed task writes messages to kafka stream. In a way, those wikipedia streams are just names that the Samza uses to request WikipediaConsumer for messages. How to messages are buffered/queued/stored before they are polled by Samza is up to WikipediaConsumer. Then if my understanding is correct, and according to Chris' reply in that thread. I have 2 options: 1) Keep it as it is now and run the Wikipedia feed task as a Samza job 2) Implement a Kafka producer that reads from external source and write to a Kafka stream/topic. This will purely just an implementation on Kafka and has nothing to do with Samza. Correct me if I'm wrong. And thank you so much for the reply. Casey On Thu, Feb 27, 2014 at 1:12 PM, Garry Turkington < [email protected]> wrote: > Hi Casey, > > Here's a link to a response from Chris on that thread which I think > gives the high-level picture pretty well: > > > http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201310.mb > ox/%3C1B43C7411DB20E47AB0FB62E7262B80179B2159F%40ESV4-MBX02.linkedin.b > iz%3E > > Basically it's a choice between building a system client within Samza > that pulls the external data or having an external process that pushes > the data into a Kafka topic that is then the input to a Samza task. > Which is most appropriate will depend on the nature of the data source > plus other considerations such as if the raw data will have other consumers. > > Though I'm not quite sure I follow what you mean re your implemented > custom system? If you look at the source for the Wikipedia feed task > it reads from the external source (Wikipedia) then writes the output > message directly to Kafka (the call to collector.send) -- isn't this what you > want? > > Regards, > Garry > > -----Original Message----- > From: Anh Thu Vu [mailto:[email protected]] > Sent: 27 February 2014 11:32 > To: [email protected] > Subject: Read from a file and write to a Kafka stream with Samza > > Hi everyone, > > As the subject said, I want to read from external source (simplest > case is to read from a file) and write to a Kafka stream. Then another > StreamTask will start reading from the Kafka stream. > > I've succeeded running HelloSamza, write a similar app (VERY similar) > that have a custom Consumer reading from a file then write to a custom > system > (i.e.: testsystem.mystream). Then I have a StreamTask that read from > this custom stream, and write to a kafka stream. However, I want to > bypass the custom stream and write the message from the external > source directly to the Kafka stream. > > I guess I will have to implement the SystemFactory such that it will > return a Kafka producer for the getProducer() method but I'm not very > sure how to yet. Although I basically welcome & appreciate very much > all guides/advises/suggestions, my main objective of this mail is to > ask for the link to the thread "Writing a simple KafkaProducer in > Samza" that was mentioned in > > > https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201311.m > box/%3CEA1B8C30F3B4C34EBA71F2AAF48F5990D612E028%40Mail3.impetus.co.in% > 3E > > Thank you very much, > Casey > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: > 02/26/14 > ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: 02/26/14
