Another pointer to look at:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3369

This function Dataset.toArrowPayload here turns a Spark Dataset to a
RDD[ArrowPayload], where ArrowPayload is basically deserialized bytes in
Arrow file format.

But like Wes mentioned, this is private function in Spark and probably
needs a bit effort to use and I believe Bryan has a PR to change
ArrowPayload to be deserialized Record batches.

On Wed, Jul 25, 2018 at 7:34 AM, Richard Siebeling <rsiebel...@gmail.com>
wrote:

> Hi,
>
> @Li, same as Jieun , I'd like to start with a single machine but can
> imagine that there are use cases for a distributed approach.
> @Wes, thanks, I'll look into it,
>
> Richard
>
> On Wed, 25 Jul 2018 at 03:59, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Richard,
> >
> > I might start here in the Spark codebase to see how Spark SQL tables
> > are converted to Arrow record batches:
> >
> >
> > https://github.com/apache/spark/blob/d8aaa771e249b3f54b57ce24763e53
> fd65a0dbf7/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/
> ArrowConverters.scala
> >
> > The code has been developed to send payloads over a socket to PySpark,
> > but it could be adapted for your needs perhaps without too much
> > effort. Li and Bryan and others have worked on this so should be able
> > to answer your questions about it.
> >
> > - Wes
> >
> > On Tue, Jul 24, 2018 at 8:21 AM, Li Jin <ice.xell...@gmail.com> wrote:
> > > Hi,
> > >
> > > Do you want to collect a Spark DataFrame into Arrow format on a single
> > > machine or do you still want to keep the data distributed?
> >
>

Reply via email to