Thanks guys this is good, so let's say i configured my kafka topics to ingest data from various streams (for we are talking forex tick data) I could partition out and buffer to hdfs (which has a replication factor) based on currency pair? I.e EURUSD ...
The next question I have, is it entirely appropriate to continue consuming feeds (remember these are live feeds not pre generated) but not have an active samza job running over the feed at that point in time. This leads me back to my AM question. I am going to be consuming data continously however as a user i may want to setup and run jobs on the stream as it arrives in the context of all existing data or only on a subset of data (this may fall back to std map reduce job). I also want to write my jobs in Cucumber but thats for another list . Thoughts. -------- Original message -------- From: Chris Riccomini <[email protected]> Date:24/04/2014 03:10 (GMT+10:00) To: [email protected] Subject: Re: Application Master Hey Steve, One thing I'd add is that whereas Map/Reduce partitions tasks by file split, Samza partitions tasks by input stream partition (i.e. Kafka topic partition). It's true that a given key maps to just one partition in Samza, but it's not a 1:1 relationship--multiple keys map to the same input stream partition, and thus the same task. For example, task1 might receive messages from partition0 of the input stream, which contains messages for keys 0,2,4,6,8, etc.. Cheers, Chris On 4/22/14 10:46 PM, "Zhijie Shen" <[email protected]> wrote: >AM is the master of an distributed application on YARN. It's supposed to >negotiate with YARN for the cluster resources and monitor the status of >the >application. It's not associated with MapReduce. MapReduce V2 has its own >AM, while Samza has one iteself as well. > > >On Tue, Apr 22, 2014 at 3:40 PM, Steve Yates ><[email protected]>wrote: > >> Guys is it fair to say that YARN exposes an extension mechanism called >>the >> ApplicationMaster and by default in yarn this master is a map-reduce >> application master. >> >> In the case of samza we have implemented a streaming case of this AM >>which >> takes full advantage of the parrallel / fault tolerant mechanisms built >> into hadoop. >> >> So instead where we partition map reduce tasks bases on file size splits >> in hdfs, we split a stream into stream tasks based on some filter key? >>Is >> this correct. >> >> -S > > > > >-- >Zhijie Shen >Hortonworks Inc. >http://hortonworks.com/ > >-- >CONFIDENTIALITY NOTICE >NOTICE: This message is intended for the use of the individual or entity >to >which it is addressed and may contain information that is confidential, >privileged and exempt from disclosure under applicable law. If the reader >of this message is not the intended recipient, you are hereby notified >that >any printing, copying, dissemination, distribution, disclosure or >forwarding of this communication is strictly prohibited. If you have >received this communication in error, please contact the sender >immediately >and delete it from your system. Thank You.
