Paul, Your example is exactly the same as one which I spoke with some people on the RAPIDS.ai project about. Using Drill as a tool to gather (query) all the data to get a representative data set for an ML/AI workload, then feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow which created a GPU Data Frame. The whole point of that project was to reduce total number of memcopy operations to result in an end-to-end speed up.
That model to allow Drill to plug into other tools would be a GREAT use case for Drill. Jim On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <[email protected]> wrote: > Hi Aman, > > Thanks for sharing the update. Glad to hear things are still percolating. > > I think Drill is an under appreciated treasure for doing queries in the > complex systems that folks seem to be building today. The ability to read > multiple data sources is something that maybe only Spark can do as well. > (And Spark can't act as a general purpose query engine like Drill can.) > Adding Arrow support for input and output would build on this advantage. > > I wonder if the output (client) side might be a great first start. Could > be build as a separate app just by combining Arrow and the Drill client > code together. Would let lots of Arrow-aware apps query data with Drill > rather than having to write their own readers, own filters, own aggregators > and, in the end, their own query engine. > > Charles was asking about Summer of Code ideas. This might be one: a > stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that > and any Arrow tool in any language could talk to Drill via the bridge. > > Thanks, > - Paul > > > > On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha < > [email protected]> wrote: > > Hi Charles, > You may have seen the talk that was given on the Drill Developer Day [1] by > Karthik and me ... look for the slides on 'Drill-Arrow Integration' which > describes 2 high level options and what the integration might entail. > Option 1 corresponds to what you and Paul are discussing in this thread. > Option 2 is the deeper integration. We do plan to work on one of them (not > finalized yet) but it will likely be after 1.16.0 since Statistics support > and Resource Manager related tasks (these were also discussed in the > Developer Day) are consuming our time. If you are interested in > contributing/collaborating, let me know. > > [1] > > https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e= > > Aman > > On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]> > wrote: > > > Hi Charles, > > I didn't see anything on this on the public mailing list. Haven't seen > any > > commits related to it either. My guess is that this kind of interface is > > not important for the kind of data warehouse use cases that MapR is > > probably still trying to capture. > > I followed the Arrow mailing lists for much of last year. Not much > > activity in the Java arena. (I think most of that might be done by > Dremio.) > > Most activity in other languages. The code itself has drifted far away > from > > the original Drill structure. I found that even the metadata had vastly > > changed; turned out to be far too much work to port the "Row Set" stuff I > > did for Drill. > > This does mean, BTW, that the Drill folks did the right thing by not > > following Arrow. They'd have spend a huge amount of time tracking the > > massive changes. > > Still, converting Arrow vectors to Drill vectors might be an exercise in > > bit twirling and memory ownership. Harder now than it once was since I > > think Arrow defines all vectors to be nullable, and uses a different > scheme > > than Drill for representing nulls. > > Thanks, > > - Paul > > > > > > > > On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre < > > [email protected]> wrote: > > > > Hey Paul, > > I’m curious as to what, if anything ever came of this thread? IMHO, > > you’re on to something here. We could get the benefit of > > Arrow—specifically the interoperability with other big data tools—without > > the pain of having to completely re-work Drill. This seems like a real > > win-win to me. > > — C > > > > > On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]> > > wrote: > > > > > > Hi Ted, > > > > > > We may be confusing two very different ideas. The one is a > > Drill-to-Arrow adapter on Drill's periphery, this is the > > "crude-but-effective" integration suggestion. On the periphery we are not > > changing existing code, we're just building an adapter to read Arrow data > > into Drill, or convert Drill output to Arrow. > > > > > > The other idea, being discussed in a parallel thread, is to convert > > Drill's runtime engine to use Arrow. That is a whole other beast. > > > > > > When changing Drill internals, code must change. There is a cost > > associated with that. Whether the Arrow code is better or not is not the > > key question. Rather, the key question is simply the volume of changes. > > > > > > Drill divides into roughly two main layers: plan-time and run-time. > > Plan-time is not much affected by Arrow. But, run-time code is all about > > manipulating vectors and their metadata, often in quite detailed ways > with > > APIs unique to Drill. While swapping Arrow vectors for Drill vectors is > > conceptually simple, those of us who've looked at the details have noted > > that the sheer volume of the lines of code that must change is daunting. > > > > > > Would be good to get second options. That PR I mentioned will show the > > volume of code that changed at that time (but Drill has grown since > then.) > > Parth is another good resource as he reviewed the original PR and has > kept > > a close eye on Arrow. > > > > > > When considering Arrow in the Drill execution engine, we must > > realistically understand the cost then ask, do the benefits we gain > justify > > those costs? Would Arrow be the highest-priority investment? Frankly, > would > > Arrow integration increase Drill adoption more than the many other topics > > discussed recently on these mail lists? > > > > > > Charles and others make a strong case for Arrow for integration. What > is > > the strong case for Drill's internals? That's really the question the > group > > will want to answer. > > > > > > More details below. > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning < > > [email protected]> wrote: > > > > > > Inline. > > > > > > > > > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected] > > > > > wrote: > > > > > >> ... > > >> By contrast, migrating Drill internals to Arrow has always been seen > as > > >> the bulk of the cost; costs which the "crude-but-effective" suggestion > > >> seeks to avoid. Some of the full-integration costs include: > > >> > > >> * Reworking Drill's direct memory model to work with Arrow's. > > >> > > > > > > > > > Ted: This should be relatively isolated to the allocation/deallocation > > code. The > > > deallocation should become a no-op. The allocation becomes simpler and > > > safer. > > > > > > Paul: If only that were true. Drill has an ingenious integration of > > vector allocation and Netty. Arrow may have done the same. (Probably did, > > since such integration is key to avoiding copies on send/receive.). That > > code is highly complex. Clearly, the swap can be done; it will simply > take > > some work to get right. > > > > > > > > >> * Changing all low-level runtime code that works with vectors to > instead > > >> work with Arrow vectors. > > >> > > > > > > > > > Ted: Why? You already said that most code doesn't have to change since > > the > > > format is the same. > > > > > > Paul: My comment about the format being the same was that the direct > > memory layout is the same, allowing conversion of a Drill vector to an > > Arrow vector by relabeling the direct memory that holds the data. > > > > > > Paul: But, in the Drill runtime engine, we don't work with the memory > > directly, we use the vector APIs, mutator APIs and so on. These all > changed > > in Arrow. Granted, the Arrow versions are cleaner. But, that does mean > that > > every vector reference (of which there are thousands) must be revised to > > use the Arrow APIs. That is the cost that has put us off a bit. > > > > > > > > >> * Change all Drill's vector metadata, and code that uses that > metadata, > > to > > >> use Arrow's metadata instead. > > >> > > > > > > > > > Ted: Why? You said that converting Arrow metadata to Drill's metadata > > would be > > > simple. Why not just continue with that? > > > > > > Paul: In an API, we can convert one data structure to the other by > > writing code to copy data. But, if we change Drill's internals, we must > > rewrite code in every operator that uses Drill's metadata to instead use > > Arrows. That is a much more extensive undertaking than simply converting > > metadata on input or output. > > > > > > > > >> * Since generated code works directly with vectors, change all the > code > > >> generation. > > >> > > > > > > Ted: Why? You said the UDFs would just work. > > > > > > Paul: Again, I fear we are confusing two issues. If we don't change > > Drill's internals, then UDFs will work as today. If we do change Drill to > > Arrow, then, since UDFs are part of the code gen system, they must change > > to adapt to the Arrow APIs. Specially, Drill "holders" must be converted > to > > Arrow holders. Drill complex writers must convert to Arrow complex > writers. > > > > > > Paul: Here I'll point out that the Arrow vector code and writers have > > the same uncontrolled memory flaw that they inherited from Drill. So, if > we > > replace the mutators and writers, we might as well use the "result set > > loader" model which a) hides the details, and b) manages memory to a > given > > budget. Either way, UDFs must change if we move to Arrow for Drill > > internals. > > > > > > > > >> * Since Drill vectors and metadata are exposed via the Drill client to > > >> JDBC and ODBC, those must be revised as well. > > >> > > > > > > Ted: How much given the high level of compatibility? > > > > > > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill > vector > > and metadata classes must be revised to use Arrow vectors and metadata, > > adapting the code to the changed APIs. This is not a huge technical > > challenge, it is just a pile of work. Perhaps this was done in that Arrow > > conversion PR. > > > > > > > > > > > >> * Since the wire format will change, clients of Drill must upgrade > their > > >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.> > > > > > > > > > Ted: Doesn't this have to happen fairly often anyway? > > > > > > Ted: Perhaps this would be a good excuse for a 2.0 step. > > > > > > Paul: As Drill matures, users would appreciate the ability to use JDBC > > and ODBC drivers with multiple Drill versions. If a shop has 1000 > desktops > > using the drivers against five Drill clusters, it is impractical to > upgrade > > everything in one go. > > > > > > Paul: You hit the nail on the head: conversion to Arrow would justify a > > jump to "Drill 2.0" to explain the required big-bang upgrade (and, to > > highlight the cool new capabilities that come with Arrow.) > > > > > -- *Jim Scott*Mobile/Text | +1 (989) 450-0212 [image: MapR logo] <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>
