Help needed in streaming large data from spark-kernel to spark-client

Harmeet Singh Fri, 18 Mar 2016 20:43:03 -0700

Hi All,

I am using spark-kernel to write interactive application. The general
overview of the application is as following:


Step-1: Execute code on spark kernel to process the data. This data is
converted into string, that can be very large. Lets say the resultant data
is stored as a var 'agg' on spark.

Step-2: Then I execute println(agg) on kernel.

Step-3: Using onStream callback provided by spark-kernel client, I get data
on the client in text format. Lets call the data on client as agg_client.

Step-4: After I get agg_client, I need to process it to get final output
for the application.


In executing this workflow, I am facing one major problem. The issue is
that the data on spark-kernel server is not same as the data that I receive
on client after streaming (i.e. agg != agg_client). After further
investigation, I observed that for large string, callback method (listening
to onStream ) is called multiple times and it receive the data in smaller
chunks.

To fix this, I wait for some time till all the chunks are received on
client. However, the problem is that the chunks are not received in order.
That is, onStream is dividing data in smaller parts and those parts are not
received in order at the client. I can verify that agg != agg_client by
comparing the checksum.

So, my question is:
Is there any way to make sure that onStream sends large data in-order ?

Or can anybody suggest alternate approach to achieve the same?


Regards,
Harmeet

Help needed in streaming large data from spark-kernel to spark-client

Reply via email to