Hey Anh, You might also want to have a look at:
https://issues.apache.org/jira/browse/SAMZA-138 This is a basic implementation of a SystemConsumer that reads messages from a file, line by line, and feeds them to Samza. Cheers, Chris On 2/27/14 7:50 AM, "Garry Turkington" <[email protected]> wrote: >Hi Casey, > >Yep, your explanation is correct. > >Even though Samza is currently very much associated with Kafka the actual >model is very system neutral. If you look at the SystemConsumer interface >it has a poll method that returns a List<IncomingMessageEnvelope>. The >Samza framework calls that method on the concrete consumer implementation >which in this case is either going to be retrieving messages from a Kafka >topic or the Wikipedia feed. Then when the task writes messages out via >the Collector.write method they are passed through the similarly named >method in SystemProducer. > >In the task config file the class for the system factory is specified and >that's the mechanism to generate the actual SystemConsumer and >SystemProducer objects. Then if you needed an actual Samza system to >produce messages directly that are read by a task you'd start by >implementing the SystemFactory and SystemConsumer. If you just use an >external client to push data straight into Kafka then from Samza's >perspective it is just reading and writing from the same (Kafka) system >and has no visibility of the externals. > >Hope that helps, I'm still getting my head around some of this stuff >myself! >Garry > >-----Original Message----- >From: Anh Thu Vu [mailto:[email protected]] >Sent: 27 February 2014 14:00 >To: [email protected] >Subject: Re: Read from a file and write to a Kafka stream with Samza > >Hi Garry, > >First of all thank for the link. > >After skimming through the archive and reading your reply, I think I >misunderstood what are the streams in wikipedia system. > >So, in hello-samza, the WikipediaConsumer produces some wikipedia.* >streams which are read by Wikipedia feed task. Wikipedia feed task then >write messages from those wikipedia.* streams to kafka.wikipedia-raw. > >So my first impression of the flow of the system is: (Data file) => >(wikipedia.* streams) => (kafka.wikipedia-raw) where wikipedia.* streams >are something "physical", something like the kafka streams. > >But now after thinking more about it, I guess the system works as follow: >WikipediaConsumer listens to the IRC feed, queue the messages it receives >in a in-memory queue (or something in-memory). >The Samza periodically polls WikipediaConsumer for messages for each of >the >wikipedia.* streams, sends the messages to Wikipedia feed task. >Wikipedia feed task writes messages to kafka stream. > >In a way, those wikipedia streams are just names that the Samza uses to >request WikipediaConsumer for messages. How to messages are >buffered/queued/stored before they are polled by Samza is up to >WikipediaConsumer. > >Then if my understanding is correct, and according to Chris' reply in >that thread. I have 2 options: >1) Keep it as it is now and run the Wikipedia feed task as a Samza job >2) Implement a Kafka producer that reads from external source and write >to a Kafka stream/topic. This will purely just an implementation on Kafka >and has nothing to do with Samza. > >Correct me if I'm wrong. And thank you so much for the reply. >Casey > > >On Thu, Feb 27, 2014 at 1:12 PM, Garry Turkington < >[email protected]> wrote: > >> Hi Casey, >> >> Here's a link to a response from Chris on that thread which I think >> gives the high-level picture pretty well: >> >> >> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201310.mb >> ox/%3C1B43C7411DB20E47AB0FB62E7262B80179B2159F%40ESV4-MBX02.linkedin.b >> iz%3E >> >> Basically it's a choice between building a system client within Samza >> that pulls the external data or having an external process that pushes >> the data into a Kafka topic that is then the input to a Samza task. >> Which is most appropriate will depend on the nature of the data source >> plus other considerations such as if the raw data will have other >>consumers. >> >> Though I'm not quite sure I follow what you mean re your implemented >> custom system? If you look at the source for the Wikipedia feed task >> it reads from the external source (Wikipedia) then writes the output >> message directly to Kafka (the call to collector.send) -- isn't this >>what you want? >> >> Regards, >> Garry >> >> -----Original Message----- >> From: Anh Thu Vu [mailto:[email protected]] >> Sent: 27 February 2014 11:32 >> To: [email protected] >> Subject: Read from a file and write to a Kafka stream with Samza >> >> Hi everyone, >> >> As the subject said, I want to read from external source (simplest >> case is to read from a file) and write to a Kafka stream. Then another >> StreamTask will start reading from the Kafka stream. >> >> I've succeeded running HelloSamza, write a similar app (VERY similar) >> that have a custom Consumer reading from a file then write to a custom >> system >> (i.e.: testsystem.mystream). Then I have a StreamTask that read from >> this custom stream, and write to a kafka stream. However, I want to >> bypass the custom stream and write the message from the external >> source directly to the Kafka stream. >> >> I guess I will have to implement the SystemFactory such that it will >> return a Kafka producer for the getProducer() method but I'm not very >> sure how to yet. Although I basically welcome & appreciate very much >> all guides/advises/suggestions, my main objective of this mail is to >> ask for the link to the thread "Writing a simple KafkaProducer in >> Samza" that was mentioned in >> >> >> https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201311.m >> box/%3CEA1B8C30F3B4C34EBA71F2AAF48F5990D612E028%40Mail3.impetus.co.in% >> 3E >> >> Thank you very much, >> Casey >> >> ----- >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: >> 02/26/14 >> > >----- >No virus found in this message. >Checked by AVG - www.avg.com >Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: 02/26/14
