Async RDD saves

Antonin Delpeuch (lists) Fri, 07 Aug 2020 06:58:23 -0700

Hi all,

Following my request on the user mailing list [1], there does not seem
to be any simple way to save RDDs to the file system in an asynchronous
way. I am looking into implementing this, so I am first checking whether
there is consensus around the idea.


The goal would be to add methods such as `saveAsTextFileAsync` and
`saveAsObjectFileAsync` to the RDD API.

I am thinking about doing this by:

- refactoring SparkHadoopWriter to allow for submitting jobs
asynchronously (with `submitJob` rather than `runJob`)

- add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
counterpart to the existing `saveAsHadoopFile`

- add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.

Because SparkHadoopWriter is private, it is complicated to reimplement
this functionality outside of Spark as a user, so I think this would be
an API worth offering. It should be possible to implement this without
too much code duplication hopefully.

Cheers,

Antonin

[1]:
http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html



---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Async RDD saves

Reply via email to