Hi All, I am using spark-kernel to write interactive application. The general overview of the application is as following:
Step-1: Execute code on spark kernel to process the data. This data is converted into string, that can be very large. Lets say the resultant data is stored as a var 'agg' on spark. Step-2: Then I execute println(agg) on kernel. Step-3: Using onStream callback provided by spark-kernel client, I get data on the client in text format. Lets call the data on client as agg_client. Step-4: After I get agg_client, I need to process it to get final output for the application. In executing this workflow, I am facing one major problem. The issue is that the data on spark-kernel server is not same as the data that I receive on client after streaming (i.e. agg != agg_client). After further investigation, I observed that for large string, callback method (listening to onStream ) is called multiple times and it receive the data in smaller chunks. To fix this, I wait for some time till all the chunks are received on client. However, the problem is that the chunks are not received in order. That is, onStream is dividing data in smaller parts and those parts are not received in order at the client. I can verify that agg != agg_client by comparing the checksum. So, my question is: Is there any way to make sure that onStream sends large data in-order ? Or can anybody suggest alternate approach to achieve the same? Regards, Harmeet
