Ted Dunning
Mon, 07 Apr 2008 09:08:58 -0700
Sliding windows are good for some things, but often involve lots of repeated work if the amount the window slides is small compared with the window width. Processing small batches of input into a summary form that can be applied to a summary of other small batches can avoid this repeated work. If you have a usable summary form, this works well. If you don't, there is a threshold by batch size where one approach or the other will be preferable. Having many small batches will eventually cause performance degradation so severe that processing the entire window will be faster. There are hybrid solutions as well where most of the window is grouped into a large batch and the new data is merged. This requires aging out old data which can get kind of tricky. On 4/7/08 8:02 AM, "pi song" <[EMAIL PROTECTED]> wrote: > 2. From Casper "Logfiles from S3 is already delayed apx. 2 hours. so I > really have no pressure.", this reminds me about stream processing again. I > used to say stream processing is real-time but MapReduce is batch. Now I've > just recognized that we don't have to be strictly real-time. If say we do > process using sliding windows every 2 hours, this way we still can apply > some stream concepts to real-world applications.