Thanks Julien and Bryan. Bryan, perfect, this is super helpful. I will check your recent update to https://github.com/BryanCutler/spark/commits/wip-toPandas_with_arrow-SPARK-13534 and rebase on top it.
On Wed, Apr 26, 2017 at 10:23 PM, Bryan Cutler <cutl...@gmail.com> wrote: > I just update my PR for SPARK-13534 > https://github.com/apache/spark/pull/15821 that uses the latest from > Arrow, > hopefully that should help. I also have been playing around with Python > UDFs in Spark with Arrow. I have something sort of working, there are > still some issues though and the branch is kind of messy right now, but > feel free to check it out > https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I > just mention this because I saw you created a related Spark PR and I'd be > glad to help out if you want. > > Bryan > > On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <jul...@dremio.com> wrote: > > > Example of writing to and reading from a file: > > https://github.com/apache/arrow/blob/master/java/vector/ > > src/test/java/org/apache/arrow/vector/file/TestArrowFile.java > > Similarly, in case you don't want to go through a file: > > Unloading a vector into buffers and loading from buffers: > > https://github.com/apache/arrow/blob/master/java/vector/ > > src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java > > The VectorLoader/Unloader are used to read/write FIles > > > > On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ice.xell...@gmail.com> wrote: > > > > > Thanks for the various pointers. I was looking at > ArrowFileWriter/Reader > > > and got a little bit confused. > > > > > > So what I am trying to do is to convert a list of spark rows into some > > > arrow format in java ( I will probably go with the file format for > now), > > > send the bytes to python, deserialize it into a pyarrow table. > > > > > > What is what I currently plan to do: > > > (1) convert the rows to one or more arrow batch record (Use the > > > ValueVectors) > > > (2) serialize the arrow batch records send it over to python (Not sure > to > > > use here, ArrowFileWriter?) > > > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader > > > > > > I *think* ArrowFileWriter is what I should use to send data over in > (2), > > > but: > > > (1) I would need to turn the arrow batch records into a > VectorSchemaRoot > > > by doing sth like > > > this > > > https://github.com/icexelloss/spark/blob/pandas-udf/sql/ > > > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala# > L226 > > > (2) I am not sure how do I write all the data in a vector schema root > > using > > > ArrowFileWriter. > > > > > > Does this sound the right thing to do? > > > > > > Thanks, > > > Li > > > > > > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > Also, now that we have a website that is easier to write content for > > (in > > > > Markdown), it would be great if some Java developers could volunteer > > some > > > > time to write user-facing documentation to go with the Javadocs. > > > > > > > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > There is also https://github.com/apache/arrow/blob/master/java/ > > > > > veator/src/test/java/org/apache/arrow/vector/file/ > > > > TestArrowStreamPipe.java > > > > > > > > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ice.xell...@gmail.com> > > wrote: > > > > > > > > > >> Thanks Julien. I will follow > > > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a > > > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/ > > > > >> arrow/vector/stream/MessageSerializerTest.java#L91 > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Julien > > >