Hi Everyone, I'm an Arrow contributor mostly on the C++ side of things, but I'll try to give a brief update of where I believe the project currently is (the views are my own, but hopefully are fairly accurate :).
I think in the long run the diagram mentioned by Jim, is were we would like Arrow to be, but it is clearly not there yet there. The ticket referenced [2] was an actionable first step for using Arrow, not the end state (after I finished a couple of more items in Arrow I was hoping to try to work on that ticket, but that might be a little ways out still). In terms of specification [1], the memory model specification seems to be fairly stable but might undergo a couple of more tweaks (there is still some discussion on making a first class string/binary column type and how we address endianness on different systems). The RPC model has a first draft for how to serialize types to a stream/memory space, but needs to be fleshed out some more to deal the practicalities of resource management. The Java implementation of Arrow was originally taken from the Apache Drill code base and I think it is fairly close to conforming to the specification [1] (if not already doing so). But it is should be "mature" in the sense that is being used in a real system. The C++/Python code is still in development and I hope to have at least a prototype showing some form of communication between a JVM and C++/Python process over the next couple weeks. The last time I looked at the spark code base, the main difference I saw with Arrow versus the existing columnar memory structure in Spark was how null flags and boolean values are stored. Arrow bit-packs null flags/boolean values. Spark seems to have one byte per value. I did not take a close look to see if the memory allocation APIs were compatible between Spark and Arrow. If people are interested, I for one would like to here feedback on the current specifications/code for Arrow. So please feel free to chime in the Arrow-dev mailing list. Thanks, Micah [1] https://github.com/apache/arrow/tree/master/format [2] https://issues.apache.org/jira/browse/SPARK-13534 On Fri, Aug 5, 2016 at 4:07 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Don't know much about Spark + Arrow efforts myself; just wanted to share > the reference. > > On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski <jpivar...@gmail.com> wrote: > >> On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534 >>> >> >> Thank you. This ticket describes output from Spark to Arrow for flat >> (non-nested) tables. Are there no plans to input from Arrow to Spark for >> general types? Did I misunderstand the blogs? >> >> I don't see it in this search: https://issues.apache. >> org/jira/browse/SPARK-13534?jql=project%20%3D%20SPARK% >> 20AND%20text%20~%20%22arrow%22 >> >> I'm beginning to think there's some misleading information out there >> (like this diagram: https://arrow.apache.org/img/shared2.png). >> >> -- JIm >> >>