PIG provides a higher level programming interface but doesn't change the fundamental batch-oriented semantics to a stream-based semantics. As long as PIG is compiled into Map/Reduce job, it is using the same batch-oriented mechanism.
I am not talking about "record boundary". I am talking about the boundary between 2 consecutive map/reduce cycles within a continuous data stream. I am thinking Ted's suggestion on the incremental small batch approach may be a good solution although I am not sure how small the batch should be. I assume there are certain overhead of running Hadoop so the batch shouldn't be too small. And there is a tradeoff decision to make between the delay of result and the batch size. And I guess in most case this should be ok. Rgds, Ricky -----Original Message----- From: Jeff Zhang [mailto:[email protected]] Sent: Saturday, October 10, 2009 1:51 AM To: [email protected] Subject: Re: Parallel data stream processing I snuggest you to use pig to handle your problem. Pig is a sub-project of hadoop. And you do not need to worry about the boundary problem. Actually hadoop handle that for you. InputFormat help you split the data , and RecordReader guarantee the record boundary. Jeff zhang On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <[email protected]> wrote: > I'd like to get some Hadoop experts to verify my understanding ... > > To my understanding, within a Map/Reduce cycle, the input data set is > "freeze" (no change is allowed) while the output data set is "created from > scratch" (doesn't exist before). Therefore, the map/reduce model is > inherently "batch-oriented". Am I right ? > > I am thinking whether Hadoop is usable in processing many data streams in > parallel. For example, thinking about a e-commerce site which capture > user's product search in many log files, and they want to run some analytics > on the log files at real time. > > One naïve way is to chunkify the log and perform Map/Reduce in small > batches. Since the input data file must be freezed, therefore we need to > switch subsequent write to a new logfile. However, the chunking approach is > not good because the cutoff point is quite arbitrary. Imagine if I want to > calculate the popularity of a product based on the frequency of searches > within last 2 hours (a sliding time window). I don't think Hadoop can do > this computation. > > Of course, if we don't mind a distorted picture, we can use a jumping > window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But > this is still not good, because we have to wait for two hours before getting > the new batch of result. (e.g. At 4:59 PM, we only have the result in the > 1-3 PM batch) > > It doesn't seem like Hadoop is good at handling this kind of processing: > "Parallel processing of multiple real time data stream processing". Anyone > disagree ? The term "Hadoop streaming" is confusing because it means > completely different thing to me (ie: use stdout and stdin as input and > output data) > > I'm wondering if a "mapper-only" model would work better. In this case, > there is no reducer (ie: no grouping). Each map task keep a history (ie: > sliding window) of data that it has seen and then write the result to the > output file. > > I heard about the "append" mode of HDFS, but don't quite get it. Does it > simply mean a writer can write to the end of an existing HDFS file ? Or > does it mean a reader can read while a writer is appending on the same HDFS > file ? Is this "append-mode" feature helpful in my situation ? > > Rgds, > Ricky >
