Dataset?

Micah Kornfield Fri, 05 Aug 2016 20:18:04 -0700

Hi Everyone,
I'm an Arrow contributor mostly on the C++ side of things, but I'll try to
give a brief update of where I believe the project currently is (the views
are my own, but hopefully are fairly accurate :).

I think in the long run the diagram mentioned by Jim, is were we would like
Arrow to be, but it is clearly not there yet there.   The ticket referenced
[2] was an actionable first step for using Arrow, not the end state (after
I finished a couple of more items in Arrow I was hoping to try to work on
that ticket, but that might be a little ways out still).

In terms of specification [1], the memory model specification seems to be
fairly stable but might undergo a couple of more tweaks (there is still
some discussion on making a first class string/binary column type  and how
we address endianness on different systems).  The RPC model has a first
draft for how to serialize types to a stream/memory space, but needs to be
fleshed out some more to deal the practicalities of resource management.

The Java implementation of Arrow was originally taken from the Apache Drill
code base and I think it is fairly close to conforming to the specification
[1] (if not already doing so).  But it is should be "mature" in the sense
that is being used in a real system.

The C++/Python code is still in development and I hope to have at least a
prototype showing some form of communication between a JVM and C++/Python
process over the next couple weeks.

The last time I looked at the spark code base, the main difference I saw
with Arrow versus the existing columnar memory structure in Spark was how
null flags and boolean values are stored.  Arrow bit-packs null
flags/boolean values. Spark seems to have one byte per value.  I did not
take a close look to see if the memory allocation APIs were compatible
between Spark and Arrow.

If people are interested, I for one would like to here feedback on the
current specifications/code for Arrow.  So please feel free to chime in the
Arrow-dev mailing list.

Thanks,
Micah

[1] https://github.com/apache/arrow/tree/master/format
[2] https://issues.apache.org/jira/browse/SPARK-13534

On Fri, Aug 5, 2016 at 4:07 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Don't know much about Spark + Arrow efforts myself; just wanted to share
> the reference.
>
> On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski <jpivar...@gmail.com> wrote:
>
>> On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534
>>>
>>
>> Thank you. This ticket describes output from Spark to Arrow for flat
>> (non-nested) tables. Are there no plans to input from Arrow to Spark for
>> general types? Did I misunderstand the blogs?
>>
>> I don't see it in this search: https://issues.apache.
>> org/jira/browse/SPARK-13534?jql=project%20%3D%20SPARK%
>> 20AND%20text%20~%20%22arrow%22
>>
>> I'm beginning to think there's some misleading information out there
>> (like this diagram: https://arrow.apache.org/img/shared2.png).
>>
>> -- JIm
>>
>>

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

Reply via email to