Another pointer to look at: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3369
This function Dataset.toArrowPayload here turns a Spark Dataset to a RDD[ArrowPayload], where ArrowPayload is basically deserialized bytes in Arrow file format. But like Wes mentioned, this is private function in Spark and probably needs a bit effort to use and I believe Bryan has a PR to change ArrowPayload to be deserialized Record batches. On Wed, Jul 25, 2018 at 7:34 AM, Richard Siebeling <rsiebel...@gmail.com> wrote: > Hi, > > @Li, same as Jieun , I'd like to start with a single machine but can > imagine that there are use cases for a distributed approach. > @Wes, thanks, I'll look into it, > > Richard > > On Wed, 25 Jul 2018 at 03:59, Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Richard, > > > > I might start here in the Spark codebase to see how Spark SQL tables > > are converted to Arrow record batches: > > > > > > https://github.com/apache/spark/blob/d8aaa771e249b3f54b57ce24763e53 > fd65a0dbf7/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ > ArrowConverters.scala > > > > The code has been developed to send payloads over a socket to PySpark, > > but it could be adapted for your needs perhaps without too much > > effort. Li and Bryan and others have worked on this so should be able > > to answer your questions about it. > > > > - Wes > > > > On Tue, Jul 24, 2018 at 8:21 AM, Li Jin <ice.xell...@gmail.com> wrote: > > > Hi, > > > > > > Do you want to collect a Spark DataFrame into Arrow format on a single > > > machine or do you still want to keep the data distributed? > > >