Thanks Julien and Bryan.

Bryan, perfect, this is super helpful. I will check your recent update to
https://github.com/BryanCutler/spark/commits/wip-toPandas_with_arrow-SPARK-13534
and rebase on top it.

On Wed, Apr 26, 2017 at 10:23 PM, Bryan Cutler <cutl...@gmail.com> wrote:

> I just update my PR for SPARK-13534
> https://github.com/apache/spark/pull/15821 that uses the latest from
> Arrow,
> hopefully that should help.  I also have been playing around with Python
> UDFs in Spark with Arrow.  I have something sort of working, there are
> still some issues though and the branch is kind of messy right now, but
> feel free to check it out
> https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I
> just mention this because I saw you created a related Spark PR and I'd be
> glad to help out if you want.
>
> Bryan
>
> On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <jul...@dremio.com> wrote:
>
> > Example of writing to and reading from a file:
> > https://github.com/apache/arrow/blob/master/java/vector/
> > src/test/java/org/apache/arrow/vector/file/TestArrowFile.java
> > Similarly, in case you don't want to go through a file:
> > Unloading a vector into buffers and loading from buffers:
> > https://github.com/apache/arrow/blob/master/java/vector/
> > src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java
> > The VectorLoader/Unloader are used to read/write FIles
> >
> > On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Thanks for the various pointers. I was looking at
> ArrowFileWriter/Reader
> > > and got a little bit confused.
> > >
> > > So what I am trying to do is to convert a list of spark rows into some
> > > arrow format in java ( I will probably go with the file format for
> now),
> > > send the bytes to python, deserialize it into a pyarrow table.
> > >
> > > What is what I currently plan to do:
> > > (1) convert the rows to one or more arrow batch record (Use the
> > > ValueVectors)
> > > (2) serialize the arrow batch records send it over to python (Not sure
> to
> > > use here, ArrowFileWriter?)
> > > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader
> > >
> > > I *think* ArrowFileWriter is what I should use to send data over in
> (2),
> > > but:
> > > (1)  I would need to turn the arrow batch records into a
> VectorSchemaRoot
> > > by doing sth like
> > > this
> > > https://github.com/icexelloss/spark/blob/pandas-udf/sql/
> > > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#
> L226
> > > (2) I am not sure how do I write all the data in a vector schema root
> > using
> > > ArrowFileWriter.
> > >
> > > Does this sound the right thing to do?
> > >
> > > Thanks,
> > > Li
> > >
> > > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > Also, now that we have a website that is easier to write content for
> > (in
> > > > Markdown), it would be great if some Java developers could volunteer
> > some
> > > > time to write user-facing documentation to go with the Javadocs.
> > > >
> > > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >
> > > > > There is also https://github.com/apache/arrow/blob/master/java/
> > > > > veator/src/test/java/org/apache/arrow/vector/file/
> > > > TestArrowStreamPipe.java
> > > > >
> > > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ice.xell...@gmail.com>
> > wrote:
> > > > >
> > > > >> Thanks Julien. I will follow
> > > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> > > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> > > > >> arrow/vector/stream/MessageSerializerTest.java#L91
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Reply via email to