Re: "Crude-but-effective" Arrow integration

Paul Rogers Mon, 20 Aug 2018 09:12:45 -0700

Hi Charles,

Regarding UDFs and Arrow: if Arrow is used just as an interface format (as 
outlined in the original post), then Drill's internals continue to use Drill 
value vectors and UDFs are unchanged.

If Arrow is adopted internally in Drill, then vast amounts of runtime code must 
change (see my reply to Ted), including the UDF API.

You had mentioned writing UDFs in Python. UDFs typically run in the Filter 
operator. Perhaps you're thinking that the Filter operator could serialize the 
current record batch to Python using Arrow, the Python UDF could filter the 
rows, and the data is then serialized back into Drill. Clearly, this will 
impact performance. But, if we were to do that, then the same kind of "shim" 
layer described in the original post could be used in the "PyDrill" extension.

We'd probably just write a new PyFilter operator that handles the details. 
Sounds like a great project for someone who wants to learn Drill internals. 
Such a project would be pretty self-contained, but would develop a deep 
expertise in Drill's runtime implementation.

Thanks,
- Paul

    On Monday, August 20, 2018, 7:18:21 AM PDT, Charles Givre 
<[email protected]> wrote:  

 Hi Paul, 
This is a very interesting approach.  i really like the concept in that it 
sounds like we could prove the value of the Arrow integration without “major 
surgery” to Drill.  If it proves to be valuable we could proceed with deeper 
integration, or if we determine that it is not necessary, we could avoid major 
work to Drill. 

I was concerned in reading about the ideas for Arrow integration, that it would 
complicate existing UDFs and/or Format-plugins.  How much of this do you 
envision would be included with Drill?

—C

> On Aug 18, 2018, at 19:44, Paul Rogers <[email protected]> wrote:
> 
> Hi All,
> 
> Charles recently suggested why Arrow integration could be helpful. (See quote 
> below.)  When we've looked at reworking Drill's internals to use Arrow, we 
> found the project to be costly with little direct benefit in terms of 
> performance or stability. But, Charles points out that the real value is in 
> data exchange, not in changing Drill's internals.
> 
> It might be fairly simple to integrate with Arrow for input or output. Why? 
> As it turns out (last time I checked) the memory layout of Arrow vectors is 
> identical to Drill's, so it is simply a matter of reinterpreting Drill's 
> vectors as Arrow vectors (or visa-versa); possibly passing memory ownership 
> somehow. (I suspect the memory ownership issue will be the fussiest part of 
> the whole exercise.)
> 
> 
> Drill and Arrow use different metadata formats. But, since they both describe 
> the same in-memory layout, we can probably translate from one to the other 
> with some straightforward code. Since metadata is a small part of a typical 
> result set, the overhead of the metadata translation is likely negligible.
> 
> 
> If an Arrow client wants to consume Drill output, someone could wrap the 
> Drill native Drill Client API that speaks Drill value vectors. The wrapper 
> could reinterpret Drill vectors as Arrow vectors, and convert metadata.
> 
> 
> If we want Drill to consume Arrow data, then we'd have to play the same trick 
> in reverse: reinterpret Arrow vectors as Drill vectors, then convert Arrow 
> metadata to Drill format.
> 
> Building such integration can be done by the community to enable integration. 
> Granted, this approach is a bit on the "crude-but-effective" side. But, if 
> the integration proves valuable, then there is justification for a next round 
> of deeper integration.
> 
> 
> Charles' original comment from the discussion about project state:
> 
> (quote)
> The first [suggested improvement] is the Arrow integration.  I’m not enough 
> of a software engineer to understand
> all the internal details here, but as I understand it, the promise of Arrow 
> is that many tools
> will share a common memory model and that it will be possible to transfer 
> data from one tool
> to the other without having to serialize/deserialize the data.  In the data 
> science community
> many of the major platforms, Python-pandas, R, and Spark are moving or have 
> adopted Arrow.
> 
> Drill’s strength is the ease that it can query many different data sources 
> and if Drill
> were to adopt Arrow, I suspect that many people would adopt it as a part of a 
> machine learning
> pipeline.  Just recently, I attempted to do some data manipulation using 
> Spark, and couldn’t
> help but notice how difficult ti was in contrast with Drill. I’m sure this is 
> a very complex
> task, but I do think that it could be worth it in the end.
> 
> (unquote)
> 
> Thanks,
> - Paul
>

Re: "Crude-but-effective" Arrow integration

Reply via email to