Re: Parallel data stream processing

Amandeep Khurana Sat, 10 Oct 2009 20:41:48 -0700

We needed to process a stream of data too and the best that we could get to
was incremental data imports and incremental processing. So, thats probably
your best bet as of now.


-Amandeep


On Sat, Oct 10, 2009 at 8:05 AM, Ricky Ho <[email protected]> wrote:

> PIG provides a higher level programming interface but doesn't change the
> fundamental batch-oriented semantics to a stream-based semantics.  As long
> as PIG is compiled into Map/Reduce job, it is using the same batch-oriented
> mechanism.
>
> I am not talking about "record boundary".  I am talking about the boundary
> between 2 consecutive map/reduce cycles within a continuous data stream.
>
> I am thinking Ted's suggestion on the incremental small batch approach may
> be a good solution although I am not sure how small the batch should be.  I
> assume there are certain overhead of running Hadoop so the batch shouldn't
> be too small.  And there is a tradeoff decision to make between the delay of
> result and the batch size.  And I guess in most case this should be ok.
>
> Rgds,
> Ricky
> -----Original Message-----
> From: Jeff Zhang [mailto:[email protected]]
> Sent: Saturday, October 10, 2009 1:51 AM
> To: [email protected]
> Subject: Re: Parallel data stream processing
>
> I snuggest you to use pig to handle your problem.  Pig is a sub-project of
> hadoop.
>
> And you do not need to worry about the boundary problem. Actually hadoop
> handle that for you.
>
> InputFormat help you split the data , and RecordReader guarantee the record
> boundary.
>
>
> Jeff zhang
>
>
> On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <[email protected]> wrote:
>
> > I'd like to get some Hadoop experts to verify my understanding ...
> >
> > To my understanding, within a Map/Reduce cycle, the input data set is
> > "freeze" (no change is allowed) while the output data set is "created
> from
> > scratch" (doesn't exist before).  Therefore, the map/reduce model is
> > inherently "batch-oriented".  Am I right ?
> >
> > I am thinking whether Hadoop is usable in processing many data streams in
> > parallel.  For example, thinking about a e-commerce site which capture
> > user's product search in many log files, and they want to run some
> analytics
> > on the log files at real time.
> >
> > One naïve way is to chunkify the log and perform Map/Reduce in small
> > batches.  Since the input data file must be freezed, therefore we need to
> > switch subsequent write to a new logfile.  However, the chunking approach
> is
> > not good because the cutoff point is quite arbitrary.  Imagine if I want
> to
> > calculate the popularity of a product based on the frequency of searches
> > within last 2 hours (a sliding time window).  I don't think Hadoop can do
> > this computation.
> >
> > Of course, if we don't mind a distorted picture, we can use a jumping
> > window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.
>  But
> > this is still not good, because we have to wait for two hours before
> getting
> > the new batch of result.  (e.g. At 4:59 PM, we only have the result in
> the
> > 1-3 PM batch)
> >
> > It doesn't seem like Hadoop is good at handling this kind of processing:
> >  "Parallel processing of multiple real time data stream processing".
>  Anyone
> > disagree ?  The term "Hadoop streaming" is confusing because it means
> > completely different thing to me (ie: use stdout and stdin as input and
> > output data)
> >
> > I'm wondering if a "mapper-only" model would work better.  In this case,
> > there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> > sliding window) of data that it has seen and then write the result to the
> > output file.
> >
> > I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> > simply mean a writer can write to the end of an existing HDFS file ?  Or
> > does it mean a reader can read while a writer is appending on the same
> HDFS
> > file ?  Is this "append-mode" feature helpful in my situation ?
> >
> > Rgds,
> > Ricky
> >
>

Re: Parallel data stream processing

Reply via email to