Re: "Crude-but-effective" Arrow integration

Ted Dunning Mon, 20 Aug 2018 08:18:02 -0700

This makes it sound like allocation is the important difference. As such
that might mean that converting drill would be easier than was thought.


On Sat, Aug 18, 2018, 16:44 Paul Rogers <[email protected]> wrote:

> Hi All,
>
> Charles recently suggested why Arrow integration could be helpful. (See
> quote below.)  When we've looked at reworking Drill's internals to use
> Arrow, we found the project to be costly with little direct benefit in
> terms of performance or stability. But, Charles points out that the real
> value is in data exchange, not in changing Drill's internals.
>
>  It might be fairly simple to integrate with Arrow for input or output.
> Why? As it turns out (last time I checked) the memory layout of Arrow
> vectors is identical to Drill's, so it is simply a matter of reinterpreting
> Drill's vectors as Arrow vectors (or visa-versa); possibly passing memory
> ownership somehow. (I suspect the memory ownership issue will be the
> fussiest part of the whole exercise.)
>
>
> Drill and Arrow use different metadata formats. But, since they both
> describe the same in-memory layout, we can probably translate from one to
> the other with some straightforward code. Since metadata is a small part of
> a typical result set, the overhead of the metadata translation is likely
> negligible.
>
>
> If an Arrow client wants to consume Drill output, someone could wrap the
> Drill native Drill Client API that speaks Drill value vectors. The wrapper
> could reinterpret Drill vectors as Arrow vectors, and convert metadata.
>
>
> If we want Drill to consume Arrow data, then we'd have to play the same
> trick in reverse: reinterpret Arrow vectors as Drill vectors, then convert
> Arrow metadata to Drill format.
>
> Building such integration can be done by the community to enable
> integration. Granted, this approach is a bit on the "crude-but-effective"
> side. But, if the integration proves valuable, then there is justification
> for a next round of deeper integration.
>
>
>  Charles' original comment from the discussion about project state:
>
> (quote)
> The first [suggested improvement] is the Arrow integration.  I’m not
> enough of a software engineer to understand
> all the internal details here, but as I understand it, the promise of
> Arrow is that many tools
> will share a common memory model and that it will be possible to transfer
> data from one tool
> to the other without having to serialize/deserialize the data.  In the
> data science community
> many of the major platforms, Python-pandas, R, and Spark are moving or
> have adopted Arrow.
>
> Drill’s strength is the ease that it can query many different data sources
> and if Drill
> were to adopt Arrow, I suspect that many people would adopt it as a part
> of a machine learning
> pipeline.  Just recently, I attempted to do some data manipulation using
> Spark, and couldn’t
> help but notice how difficult ti was in contrast with Drill. I’m sure this
> is a very complex
> task, but I do think that it could be worth it in the end.
>
> (unquote)
>
> Thanks,
> - Paul
>
>

Re: "Crude-but-effective" Arrow integration

Reply via email to