Re: "Crude-but-effective" Arrow integration

Jim Scott Wed, 30 Jan 2019 09:49:46 -0800

Paul,

Your example is exactly the same as one which I spoke with some people on
the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
the data to get a representative data set for an ML/AI workload, then
feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
which created a GPU Data Frame. The whole point of that project was to
reduce total number of memcopy operations to result in an end-to-end speed
up.


That model to allow Drill to plug into other tools would be a GREAT use
case for Drill.

Jim

On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <[email protected]>
wrote:

> Hi Aman,
>
> Thanks for sharing the update. Glad to hear things are still percolating.
>
> I think Drill is an under appreciated treasure for doing queries in the
> complex systems that folks seem to be building today. The ability to read
> multiple data sources is something that maybe only Spark can do as well.
> (And Spark can't act as a general purpose query engine like Drill can.)
> Adding Arrow support for input and output would build on this advantage.
>
> I wonder if the output (client) side might be a great first start. Could
> be build as a separate app just by combining Arrow and the Drill client
> code together. Would let lots of Arrow-aware apps query data with Drill
> rather than having to write their own readers, own filters, own aggregators
> and, in the end, their own query engine.
>
> Charles was asking about Summer of Code ideas. This might be one: a
> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
> and any Arrow tool in any language could talk to Drill via the bridge.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
> [email protected]> wrote:
>
>  Hi Charles,
> You may have seen the talk that was given on the Drill Developer Day [1] by
> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
> describes 2 high level options and what the integration might entail.
> Option 1 corresponds to what you and Paul are discussing in this thread.
> Option 2 is the deeper integration.  We do plan to work on one of them (not
> finalized yet) but it will likely be after 1.16.0 since Statistics support
> and Resource Manager related tasks (these were also discussed in the
> Developer Day) are consuming our time.  If you are interested in
> contributing/collaborating, let me know.
>
> [1]
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=
>
> Aman
>
> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]>
> wrote:
>
> > Hi Charles,
> > I didn't see anything on this on the public mailing list. Haven't seen
> any
> > commits related to it either. My guess is that this kind of interface is
> > not important for the kind of data warehouse use cases that MapR is
> > probably still trying to capture.
> > I followed the Arrow mailing lists for much of last year. Not much
> > activity in the Java arena. (I think most of that might be done by
> Dremio.)
> > Most activity in other languages. The code itself has drifted far away
> from
> > the original Drill structure. I found that even the metadata had vastly
> > changed; turned out to be far too much work to port the "Row Set" stuff I
> > did for Drill.
> > This does mean, BTW, that the Drill folks did the right thing by not
> > following Arrow. They'd have spend a huge amount of time tracking the
> > massive changes.
> > Still, converting Arrow vectors to Drill vectors might be an exercise in
> > bit twirling and memory ownership. Harder now than it once was since I
> > think Arrow defines all vectors to be nullable, and uses a different
> scheme
> > than Drill for representing nulls.
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
> > [email protected]> wrote:
> >
> >  Hey Paul,
> > I’m curious as to what, if anything ever came of this thread?  IMHO,
> > you’re on to something here.  We could get the benefit of
> > Arrow—specifically the interoperability with other big data tools—without
> > the pain of having to completely re-work Drill. This seems like a real
> > win-win to me.
> > — C
> >
> > > On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]>
> > wrote:
> > >
> > > Hi Ted,
> > >
> > > We may be confusing two very different ideas. The one is a
> > Drill-to-Arrow adapter on Drill's periphery, this is the
> > "crude-but-effective" integration suggestion. On the periphery we are not
> > changing existing code, we're just building an adapter to read Arrow data
> > into Drill, or convert Drill output to Arrow.
> > >
> > > The other idea, being discussed in a parallel thread, is to convert
> > Drill's runtime engine to use Arrow. That is a whole other beast.
> > >
> > > When changing Drill internals, code must change. There is a cost
> > associated with that. Whether the Arrow code is better or not is not the
> > key question. Rather, the key question is simply the volume of changes.
> > >
> > > Drill divides into roughly two main layers: plan-time and run-time.
> > Plan-time is not much affected by Arrow. But, run-time code is all about
> > manipulating vectors and their metadata, often in quite detailed ways
> with
> > APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
> > conceptually simple, those of us who've looked at the details have noted
> > that the sheer volume of the lines of code that must change is daunting.
> > >
> > > Would be good to get second options. That PR I mentioned will show the
> > volume of code that changed at that time (but Drill has grown since
> then.)
> > Parth is another good resource as he reviewed the original PR and has
> kept
> > a close eye on Arrow.
> > >
> > > When considering Arrow in the Drill execution engine, we must
> > realistically understand the cost then ask, do the benefits we gain
> justify
> > those costs? Would Arrow be the highest-priority investment? Frankly,
> would
> > Arrow integration increase Drill adoption more than the many other topics
> > discussed recently on these mail lists?
> > >
> > > Charles and others make a strong case for Arrow for integration. What
> is
> > the strong case for Drill's internals? That's really the question the
> group
> > will want to answer.
> > >
> > > More details below.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
> > [email protected]> wrote:
> > >
> > > Inline.
> > >
> > >
> > > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > >> ...
> > >> By contrast, migrating Drill internals to Arrow has always been seen
> as
> > >> the bulk of the cost; costs which the "crude-but-effective" suggestion
> > >> seeks to avoid. Some of the full-integration costs include:
> > >>
> > >> * Reworking Drill's direct memory model to work with Arrow's.
> > >>
> > >
> > >
> > > Ted: This should be relatively isolated to the allocation/deallocation
> > code. The
> > > deallocation should become a no-op. The allocation becomes simpler and
> > > safer.
> > >
> > > Paul: If only that were true. Drill has an ingenious integration of
> > vector allocation and Netty. Arrow may have done the same. (Probably did,
> > since such integration is key to avoiding copies on send/receive.). That
> > code is highly complex. Clearly, the swap can be done; it will simply
> take
> > some work to get right.
> > >
> > >
> > >> * Changing all low-level runtime code that works with vectors to
> instead
> > >> work with Arrow vectors.
> > >>
> > >
> > >
> > > Ted: Why? You already said that most code doesn't have to change since
> > the
> > > format is the same.
> > >
> > > Paul: My comment about the format being the same was that the direct
> > memory layout is the same, allowing conversion of a Drill vector to an
> > Arrow vector by relabeling the direct memory that holds the data.
> > >
> > > Paul: But, in the Drill runtime engine, we don't work with the memory
> > directly, we use the vector APIs, mutator APIs and so on. These all
> changed
> > in Arrow. Granted, the Arrow versions are cleaner. But, that does mean
> that
> > every vector reference (of which there are thousands) must be revised to
> > use the Arrow APIs. That is the cost that has put us off a bit.
> > >
> > >
> > >> * Change all Drill's vector metadata, and code that uses that
> metadata,
> > to
> > >> use Arrow's metadata instead.
> > >>
> > >
> > >
> > > Ted: Why? You said that converting Arrow metadata to Drill's metadata
> > would be
> > > simple. Why not just continue with that?
> > >
> > > Paul: In an API, we can convert one data structure to the other by
> > writing code to copy data. But, if we change Drill's internals, we must
> > rewrite code in every operator that uses Drill's metadata to instead use
> > Arrows. That is a much more extensive undertaking than simply converting
> > metadata on input or output.
> > >
> > >
> > >> * Since generated code works directly with vectors, change all the
> code
> > >> generation.
> > >>
> > >
> > > Ted: Why? You said the UDFs would just work.
> > >
> > > Paul: Again, I fear we are confusing two issues. If we don't change
> > Drill's internals, then UDFs will work as today. If we do change Drill to
> > Arrow, then, since UDFs are part of the code gen system, they must change
> > to adapt to the Arrow APIs. Specially, Drill "holders" must be converted
> to
> > Arrow holders. Drill complex writers must convert to Arrow complex
> writers.
> > >
> > > Paul: Here I'll point out that the Arrow vector code and writers have
> > the same uncontrolled memory flaw that they inherited from Drill. So, if
> we
> > replace the mutators and writers, we might as well use the "result set
> > loader" model which a) hides the details, and b) manages memory to a
> given
> > budget.  Either way, UDFs must change if we move to Arrow for Drill
> > internals.
> > >
> > >
> > >> * Since Drill vectors and metadata are exposed via the Drill client to
> > >> JDBC and ODBC, those must be revised as well.
> > >>
> > >
> > > Ted: How much given the high level of compatibility?
> > >
> > > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill
> vector
> > and metadata classes must be revised to use Arrow vectors and metadata,
> > adapting the code to the changed APIs. This is not a huge technical
> > challenge, it is just a pile of work. Perhaps this was done in that Arrow
> > conversion PR.
> > >
> > >
> > >
> > >> * Since the wire format will change, clients of Drill must upgrade
> their
> > >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> > >
> > >
> > > Ted: Doesn't this have to happen fairly often anyway?
> > >
> > > Ted: Perhaps this would be a good excuse for a 2.0 step.
> > >
> > > Paul: As Drill matures, users would appreciate the ability to use JDBC
> > and ODBC drivers with multiple Drill versions. If a shop has 1000
> desktops
> > using the drivers against five Drill clusters, it is impractical to
> upgrade
> > everything in one go.
> > >
> > > Paul: You hit the nail on the head: conversion to Arrow would justify a
> > jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
> > highlight the cool new capabilities that come with Arrow.)
> > >
> >



-- 


*Jim Scott*Mobile/Text | +1 (989) 450-0212
[image: MapR logo]
<https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>

Re: "Crude-but-effective" Arrow integration

Reply via email to