Re: Parallel data stream processing

Hong Tang Sat, 10 Oct 2009 01:07:54 -0700

MapReduce is indeed inherently a batch processing model, where eachjob's outcome is deterministically determined by the input and theoperators (map, reduce, combiner) as long as the input stays immutableand the operator is deterministic and side-effect free. Such a modelallows the framework to recover from failures without having tounderstand the semantics of the operators (unlike SQL). This isimportant because failures are bound to happen (frequently) for alarge cluster assembled from commodity hardware.

A typical technique to bridge a batch system and a real-time system isto pair with the batch system with an incremental processing componentthat computes delta on top of some aggregated result. The incrementalprocessing part would also serve real-time queries, so the data aretypically stored in memory. Some times you have to choose someapproximation algorithms for the incremental part, and periodicallyreset the internal state with the more precise batch processingresults (e.g. top-k queries).


Hope this helps, Hong

On Oct 9, 2009, at 11:02 PM, Ricky Ho wrote:

I'd like to get some Hadoop experts to verify my understanding ...
To my understanding, within a Map/Reduce cycle, the input data setis "freeze" (no change is allowed) while the output data set is"created from scratch" (doesn't exist before). Therefore, the map/reduce model is inherently "batch-oriented". Am I right ?
I am thinking whether Hadoop is usable in processing many datastreams in parallel. For example, thinking about a e-commerce sitewhich capture user's product search in many log files, and they wantto run some analytics on the log files at real time.
One naïve way is to chunkify the log and perform Map/Reduce in smallbatches. Since the input data file must be freezed, therefore weneed to switch subsequent write to a new logfile. However, thechunking approach is not good because the cutoff point is quitearbitrary. Imagine if I want to calculate the popularity of aproduct based on the frequency of searches within last 2 hours (asliding time window). I don't think Hadoop can do this computation.
Of course, if we don't mind a distorted picture, we can use ajumping window (1-3 PM, 3-5 PM ...) instead of a sliding window,then maybe OK. But this is still not good, because we have to waitfor two hours before getting the new batch of result. (e.g. At 4:59PM, we only have the result in the 1-3 PM batch)
It doesn't seem like Hadoop is good at handling this kind ofprocessing: "Parallel processing of multiple real time data streamprocessing". Anyone disagree ? The term "Hadoop streaming" isconfusing because it means completely different thing to me (ie: usestdout and stdin as input and output data)
I'm wondering if a "mapper-only" model would work better. In thiscase, there is no reducer (ie: no grouping). Each map task keep ahistory (ie: sliding window) of data that it has seen and then writethe result to the output file.
I heard about the "append" mode of HDFS, but don't quite get it.Does it simply mean a writer can write to the end of an existingHDFS file ? Or does it mean a reader can read while a writer isappending on the same HDFS file ? Is this "append-mode" featurehelpful in my situation ?
Rgds,
Ricky

Re: Parallel data stream processing

Reply via email to