Re: "Crude-but-effective" Arrow integration

Charles Givre Mon, 20 Aug 2018 07:18:41 -0700

Hi Paul, 
This is a very interesting approach.  i really like the concept in that it 
sounds like we could prove the value of the Arrow integration without “major 
surgery” to Drill.  If it proves to be valuable we could proceed with deeper 
integration, or if we determine that it is not necessary, we could avoid major 
work to Drill.


I was concerned in reading about the ideas for Arrow integration, that it would 
complicate existing UDFs and/or Format-plugins.  How much of this do you 
envision would be included with Drill?

—C

> On Aug 18, 2018, at 19:44, Paul Rogers <par0...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> Charles recently suggested why Arrow integration could be helpful. (See quote 
> below.)  When we've looked at reworking Drill's internals to use Arrow, we 
> found the project to be costly with little direct benefit in terms of 
> performance or stability. But, Charles points out that the real value is in 
> data exchange, not in changing Drill's internals.
> 
> It might be fairly simple to integrate with Arrow for input or output. Why? 
> As it turns out (last time I checked) the memory layout of Arrow vectors is 
> identical to Drill's, so it is simply a matter of reinterpreting Drill's 
> vectors as Arrow vectors (or visa-versa); possibly passing memory ownership 
> somehow. (I suspect the memory ownership issue will be the fussiest part of 
> the whole exercise.)
> 
> 
> Drill and Arrow use different metadata formats. But, since they both describe 
> the same in-memory layout, we can probably translate from one to the other 
> with some straightforward code. Since metadata is a small part of a typical 
> result set, the overhead of the metadata translation is likely negligible.
> 
> 
> If an Arrow client wants to consume Drill output, someone could wrap the 
> Drill native Drill Client API that speaks Drill value vectors. The wrapper 
> could reinterpret Drill vectors as Arrow vectors, and convert metadata.
> 
> 
> If we want Drill to consume Arrow data, then we'd have to play the same trick 
> in reverse: reinterpret Arrow vectors as Drill vectors, then convert Arrow 
> metadata to Drill format.
> 
> Building such integration can be done by the community to enable integration. 
> Granted, this approach is a bit on the "crude-but-effective" side. But, if 
> the integration proves valuable, then there is justification for a next round 
> of deeper integration.
> 
> 
> Charles' original comment from the discussion about project state:
> 
> (quote)
> The first [suggested improvement] is the Arrow integration.  I’m not enough 
> of a software engineer to understand
> all the internal details here, but as I understand it, the promise of Arrow 
> is that many tools
> will share a common memory model and that it will be possible to transfer 
> data from one tool
> to the other without having to serialize/deserialize the data.  In the data 
> science community
> many of the major platforms, Python-pandas, R, and Spark are moving or have 
> adopted Arrow.
> 
> Drill’s strength is the ease that it can query many different data sources 
> and if Drill
> were to adopt Arrow, I suspect that many people would adopt it as a part of a 
> machine learning
> pipeline.  Just recently, I attempted to do some data manipulation using 
> Spark, and couldn’t
> help but notice how difficult ti was in contrast with Drill. I’m sure this is 
> a very complex
> task, but I do think that it could be worth it in the end.
> 
> (unquote)
> 
> Thanks,
> - Paul
>

Re: "Crude-but-effective" Arrow integration

Reply via email to