The block size is configurable and that way I think you can reduce the block interval, to keep the block in memory only for the limiter interval? Is that what you are looking for?
On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wbi...@gmail.com> wrote: > Hi, > > I'm learning Spark and I find there could be some optimize for the current > streaming implementation. Correct me if I'm wrong. > > The current streaming implementation put the data of one batch into memory > (as RDD). But it seems not necessary. > > For example, if I want to count the lines which contains word "Spark", I > just need to map every line to see if it contains word, then reduce it with > a sum function. After that, this line is no longer useful to keep it in > memory. > > That is said, if the DStream only have one map and/or reduce operation on > it. It is not necessary to keep all the batch data in the memory. Something > like a pipeline should be OK. > > Is it difficult to implement on top of the current implementation? > > Thanks. > > --- > Bin Wang > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com