linar-jether commented on pull request #29719:
URL: https://github.com/apache/spark/pull/29719#issuecomment-894766319


   @HyukjinKwon @holdenk  My use case is efficiently creating a spark DataFrame 
from a distributed dataset, spark currently supports doing this either with 
remote storage (e.g. write to parquete files) or using the `rdd[Row]` method, 
both are inefficient..
   
   The suggestion to use `DataFrame[binary]` could work as well, although this 
does incur another serialization + copying stage, so i don't see the benefit 
over directly creating the Dataframe out of arrow RecordBatches.
   
   I must say that we use this internally quite a bit (since Spark 2.X) and it 
greatly improves productivity, some example use cases: Reading large 
climatological datasets using 
[xarray](http://xarray.pydata.org/en/stable/index.html) and treating them as a 
single Spark DataFrame.
   Running many optimization problems (e.g. [cvxpy](https://www.cvxpy.org/)) in 
parallel using RDDs and accessing their results as single Spark DataFrame.
   
   I believe this feature can improve interoperability with other python 
libraries, similar to what can be done with 
[Dask](https://docs.dask.org/en/latest/delayed-collec0tions.html)'s 
`dd.from_delayed(dfs)`, and allow people to leverage Spark's SQL capabilities 
instead of working directly with RDDs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to