This makes it sound like allocation is the important difference. As such
that might mean that converting drill would be easier than was thought.

On Sat, Aug 18, 2018, 16:44 Paul Rogers <[email protected]> wrote:

> Hi All,
>
> Charles recently suggested why Arrow integration could be helpful. (See
> quote below.)  When we've looked at reworking Drill's internals to use
> Arrow, we found the project to be costly with little direct benefit in
> terms of performance or stability. But, Charles points out that the real
> value is in data exchange, not in changing Drill's internals.
>
>  It might be fairly simple to integrate with Arrow for input or output.
> Why? As it turns out (last time I checked) the memory layout of Arrow
> vectors is identical to Drill's, so it is simply a matter of reinterpreting
> Drill's vectors as Arrow vectors (or visa-versa); possibly passing memory
> ownership somehow. (I suspect the memory ownership issue will be the
> fussiest part of the whole exercise.)
>
>
> Drill and Arrow use different metadata formats. But, since they both
> describe the same in-memory layout, we can probably translate from one to
> the other with some straightforward code. Since metadata is a small part of
> a typical result set, the overhead of the metadata translation is likely
> negligible.
>
>
> If an Arrow client wants to consume Drill output, someone could wrap the
> Drill native Drill Client API that speaks Drill value vectors. The wrapper
> could reinterpret Drill vectors as Arrow vectors, and convert metadata.
>
>
> If we want Drill to consume Arrow data, then we'd have to play the same
> trick in reverse: reinterpret Arrow vectors as Drill vectors, then convert
> Arrow metadata to Drill format.
>
> Building such integration can be done by the community to enable
> integration. Granted, this approach is a bit on the "crude-but-effective"
> side. But, if the integration proves valuable, then there is justification
> for a next round of deeper integration.
>
>
>  Charles' original comment from the discussion about project state:
>
> (quote)
> The first [suggested improvement] is the Arrow integration.  I’m not
> enough of a software engineer to understand
> all the internal details here, but as I understand it, the promise of
> Arrow is that many tools
> will share a common memory model and that it will be possible to transfer
> data from one tool
> to the other without having to serialize/deserialize the data.  In the
> data science community
> many of the major platforms, Python-pandas, R, and Spark are moving or
> have adopted Arrow.
>
> Drill’s strength is the ease that it can query many different data sources
> and if Drill
> were to adopt Arrow, I suspect that many people would adopt it as a part
> of a machine learning
> pipeline.  Just recently, I attempted to do some data manipulation using
> Spark, and couldn’t
> help but notice how difficult ti was in contrast with Drill. I’m sure this
> is a very complex
> task, but I do think that it could be worth it in the end.
>
> (unquote)
>
> Thanks,
> - Paul
>
>

Reply via email to