Re: Help needed in streaming large data from spark-kernel to spark-client

Chip Senkbeil Sat, 19 Mar 2016 13:36:03 -0700

Instead of calling println(...), you can invoke kernel.stream.sendAll(...)
to send your data without breaking it up into chunks. Give that a shot and
see if you get all of your data via onStream.


On Thu, Mar 17, 2016 at 10:33 AM Harmeet Singh <[email protected]>
wrote:

> Hi All,
>
> I am using spark-kernel to write interactive application. The general
> overview of the application is as following:
>
> Step-1: Execute code on spark kernel to process the data. This data is
> converted into string, that can be very large. Lets say the resultant data
> is stored as a var 'agg' on spark.
>
> Step-2: Then I execute println(agg) on kernel.
>
> Step-3: Using onStream callback provided by spark-kernel client, I get data
> on the client in text format. Lets call the data on client as agg_client.
>
> Step-4: After I get agg_client, I need to process it to get final output
> for the application.
>
>
> In executing this workflow, I am facing one major problem. The issue is
> that the data on spark-kernel server is not same as the data that I receive
> on client after streaming (i.e. agg != agg_client). After further
> investigation, I observed that for large string, callback method (listening
> to onStream ) is called multiple times and it receive the data in smaller
> chunks.
>
> To fix this, I wait for some time till all the chunks are received on
> client. However, the problem is that the chunks are not received in order.
> That is, onStream is dividing data in smaller parts and those parts are not
> received in order at the client. I can verify that agg != agg_client by
> comparing the checksum.
>
> So, my question is:
> Is there any way to make sure that onStream sends large data in-order ?
>
> Or can anybody suggest alternate approach to achieve the same?
>
>
> Regards,
> Harmeet
>

Re: Help needed in streaming large data from spark-kernel to spark-client

Reply via email to