Instead of calling println(...), you can invoke kernel.stream.sendAll(...) to send your data without breaking it up into chunks. Give that a shot and see if you get all of your data via onStream.
On Thu, Mar 17, 2016 at 10:33 AM Harmeet Singh <[email protected]> wrote: > Hi All, > > I am using spark-kernel to write interactive application. The general > overview of the application is as following: > > Step-1: Execute code on spark kernel to process the data. This data is > converted into string, that can be very large. Lets say the resultant data > is stored as a var 'agg' on spark. > > Step-2: Then I execute println(agg) on kernel. > > Step-3: Using onStream callback provided by spark-kernel client, I get data > on the client in text format. Lets call the data on client as agg_client. > > Step-4: After I get agg_client, I need to process it to get final output > for the application. > > > In executing this workflow, I am facing one major problem. The issue is > that the data on spark-kernel server is not same as the data that I receive > on client after streaming (i.e. agg != agg_client). After further > investigation, I observed that for large string, callback method (listening > to onStream ) is called multiple times and it receive the data in smaller > chunks. > > To fix this, I wait for some time till all the chunks are received on > client. However, the problem is that the chunks are not received in order. > That is, onStream is dividing data in smaller parts and those parts are not > received in order at the client. I can verify that agg != agg_client by > comparing the checksum. > > So, my question is: > Is there any way to make sure that onStream sends large data in-order ? > > Or can anybody suggest alternate approach to achieve the same? > > > Regards, > Harmeet >
