Re: Async RDD saves

kalyan Sat, 08 Aug 2020 02:34:09 -0700

This looks interesting.. anyways, it will be good if you can elaborate more
on the expectations and the various other ways you had tried before
deciding to do it this way...


Regards,
Kalyan.

On Fri, Aug 7, 2020, 11:24 PM Edward Mitchell <[email protected]> wrote:

> I will agree that the side effects of using Futures in driver code tend to
> be tricky to track down.
>
> If you forget to clear the job description and job group information, when
> the LocalProperties on the SparkContext remain intact -
> SparkContext#submitJob makes sure to pass down the localProperties.
>
> This has led to us doing this hack:
>
> [image: image.png]
>
> This can also cause problems with Spark Streaming where the Streaming UI
> can get messed up from the various streaming related properties set getting
> cleared or re-used.
>
> On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <[email protected]> wrote:
>
>> Why do you need to do it, and can you just use a future in your driver
>> code?
>>
>> On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
>> <[email protected]> wrote:
>> >
>> > Hi all,
>> >
>> > Following my request on the user mailing list [1], there does not seem
>> > to be any simple way to save RDDs to the file system in an asynchronous
>> > way. I am looking into implementing this, so I am first checking whether
>> > there is consensus around the idea.
>> >
>> > The goal would be to add methods such as `saveAsTextFileAsync` and
>> > `saveAsObjectFileAsync` to the RDD API.
>> >
>> > I am thinking about doing this by:
>> >
>> > - refactoring SparkHadoopWriter to allow for submitting jobs
>> > asynchronously (with `submitJob` rather than `runJob`)
>> >
>> > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
>> > counterpart to the existing `saveAsHadoopFile`
>> >
>> > - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>> >
>> > Because SparkHadoopWriter is private, it is complicated to reimplement
>> > this functionality outside of Spark as a user, so I think this would be
>> > an API worth offering. It should be possible to implement this without
>> > too much code duplication hopefully.
>> >
>> > Cheers,
>> >
>> > Antonin
>> >
>> > [1]:
>> >
>> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: Async RDD saves

Reply via email to