linar-jether commented on pull request #29719: URL: https://github.com/apache/spark/pull/29719#issuecomment-894766319
@HyukjinKwon @holdenk My use case is efficiently creating a spark DataFrame from a distributed dataset, spark currently supports doing this either with remote storage (e.g. write to parquete files) or using the `rdd[Row]` method, both are inefficient.. The suggestion to use `DataFrame[binary]` could work as well, although this does incur another serialization + copying stage, so i don't see the benefit over directly creating the Dataframe out of arrow RecordBatches. I must say that we use this internally quite a bit (since Spark 2.X) and it greatly improves productivity, some example use cases: Reading large climatological datasets using [xarray](http://xarray.pydata.org/en/stable/index.html) and treating them as a single Spark DataFrame. Running many optimization problems (e.g. [cvxpy](https://www.cvxpy.org/)) in parallel using RDDs and accessing their results as single Spark DataFrame. I believe this feature can improve interoperability with other python libraries, similar to what can be done with [Dask](https://docs.dask.org/en/latest/delayed-collec0tions.html)'s `dd.from_delayed(dfs)`, and allow people to leverage Spark's SQL capabilities instead of working directly with RDDs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
