RE: Parallel data stream processing

Ricky Ho Sat, 10 Oct 2009 08:07:38 -0700

PIG provides a higher level programming interface but doesn't change the 
fundamental batch-oriented semantics to a stream-based semantics.  As long as 
PIG is compiled into Map/Reduce job, it is using the same batch-oriented 
mechanism.


I am not talking about "record boundary".  I am talking about the boundary 
between 2 consecutive map/reduce cycles within a continuous data stream.

I am thinking Ted's suggestion on the incremental small batch approach may be a 
good solution although I am not sure how small the batch should be.  I assume 
there are certain overhead of running Hadoop so the batch shouldn't be too 
small.  And there is a tradeoff decision to make between the delay of result 
and the batch size.  And I guess in most case this should be ok.

Rgds,
Ricky
-----Original Message-----
From: Jeff Zhang [mailto:[email protected]] 
Sent: Saturday, October 10, 2009 1:51 AM
To: [email protected]
Subject: Re: Parallel data stream processing

I snuggest you to use pig to handle your problem.  Pig is a sub-project of
hadoop.

And you do not need to worry about the boundary problem. Actually hadoop
handle that for you.

InputFormat help you split the data , and RecordReader guarantee the record
boundary.


Jeff zhang


On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <[email protected]> wrote:

> I'd like to get some Hadoop experts to verify my understanding ...
>
> To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?
>
> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.  For example, thinking about a e-commerce site which capture
> user's product search in many log files, and they want to run some analytics
> on the log files at real time.
>
> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.  However, the chunking approach is
> not good because the cutoff point is quite arbitrary.  Imagine if I want to
> calculate the popularity of a product based on the frequency of searches
> within last 2 hours (a sliding time window).  I don't think Hadoop can do
> this computation.
>
> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)
>
> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?  The term "Hadoop streaming" is confusing because it means
> completely different thing to me (ie: use stdout and stdin as input and
> output data)
>
> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.
>
> I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> simply mean a writer can write to the end of an existing HDFS file ?  Or
> does it mean a reader can read while a writer is appending on the same HDFS
> file ?  Is this "append-mode" feature helpful in my situation ?
>
> Rgds,
> Ricky
>

RE: Parallel data stream processing

Reply via email to