We needed to process a stream of data too and the best that we could get to was incremental data imports and incremental processing. So, thats probably your best bet as of now.
-Amandeep On Sat, Oct 10, 2009 at 8:05 AM, Ricky Ho <[email protected]> wrote: > PIG provides a higher level programming interface but doesn't change the > fundamental batch-oriented semantics to a stream-based semantics. As long > as PIG is compiled into Map/Reduce job, it is using the same batch-oriented > mechanism. > > I am not talking about "record boundary". I am talking about the boundary > between 2 consecutive map/reduce cycles within a continuous data stream. > > I am thinking Ted's suggestion on the incremental small batch approach may > be a good solution although I am not sure how small the batch should be. I > assume there are certain overhead of running Hadoop so the batch shouldn't > be too small. And there is a tradeoff decision to make between the delay of > result and the batch size. And I guess in most case this should be ok. > > Rgds, > Ricky > -----Original Message----- > From: Jeff Zhang [mailto:[email protected]] > Sent: Saturday, October 10, 2009 1:51 AM > To: [email protected] > Subject: Re: Parallel data stream processing > > I snuggest you to use pig to handle your problem. Pig is a sub-project of > hadoop. > > And you do not need to worry about the boundary problem. Actually hadoop > handle that for you. > > InputFormat help you split the data , and RecordReader guarantee the record > boundary. > > > Jeff zhang > > > On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <[email protected]> wrote: > > > I'd like to get some Hadoop experts to verify my understanding ... > > > > To my understanding, within a Map/Reduce cycle, the input data set is > > "freeze" (no change is allowed) while the output data set is "created > from > > scratch" (doesn't exist before). Therefore, the map/reduce model is > > inherently "batch-oriented". Am I right ? > > > > I am thinking whether Hadoop is usable in processing many data streams in > > parallel. For example, thinking about a e-commerce site which capture > > user's product search in many log files, and they want to run some > analytics > > on the log files at real time. > > > > One naïve way is to chunkify the log and perform Map/Reduce in small > > batches. Since the input data file must be freezed, therefore we need to > > switch subsequent write to a new logfile. However, the chunking approach > is > > not good because the cutoff point is quite arbitrary. Imagine if I want > to > > calculate the popularity of a product based on the frequency of searches > > within last 2 hours (a sliding time window). I don't think Hadoop can do > > this computation. > > > > Of course, if we don't mind a distorted picture, we can use a jumping > > window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. > But > > this is still not good, because we have to wait for two hours before > getting > > the new batch of result. (e.g. At 4:59 PM, we only have the result in > the > > 1-3 PM batch) > > > > It doesn't seem like Hadoop is good at handling this kind of processing: > > "Parallel processing of multiple real time data stream processing". > Anyone > > disagree ? The term "Hadoop streaming" is confusing because it means > > completely different thing to me (ie: use stdout and stdin as input and > > output data) > > > > I'm wondering if a "mapper-only" model would work better. In this case, > > there is no reducer (ie: no grouping). Each map task keep a history (ie: > > sliding window) of data that it has seen and then write the result to the > > output file. > > > > I heard about the "append" mode of HDFS, but don't quite get it. Does it > > simply mean a writer can write to the end of an existing HDFS file ? Or > > does it mean a reader can read while a writer is appending on the same > HDFS > > file ? Is this "append-mode" feature helpful in my situation ? > > > > Rgds, > > Ricky > > >
