This makes it sound like allocation is the important difference. As such that might mean that converting drill would be easier than was thought.
On Sat, Aug 18, 2018, 16:44 Paul Rogers <[email protected]> wrote: > Hi All, > > Charles recently suggested why Arrow integration could be helpful. (See > quote below.) When we've looked at reworking Drill's internals to use > Arrow, we found the project to be costly with little direct benefit in > terms of performance or stability. But, Charles points out that the real > value is in data exchange, not in changing Drill's internals. > > It might be fairly simple to integrate with Arrow for input or output. > Why? As it turns out (last time I checked) the memory layout of Arrow > vectors is identical to Drill's, so it is simply a matter of reinterpreting > Drill's vectors as Arrow vectors (or visa-versa); possibly passing memory > ownership somehow. (I suspect the memory ownership issue will be the > fussiest part of the whole exercise.) > > > Drill and Arrow use different metadata formats. But, since they both > describe the same in-memory layout, we can probably translate from one to > the other with some straightforward code. Since metadata is a small part of > a typical result set, the overhead of the metadata translation is likely > negligible. > > > If an Arrow client wants to consume Drill output, someone could wrap the > Drill native Drill Client API that speaks Drill value vectors. The wrapper > could reinterpret Drill vectors as Arrow vectors, and convert metadata. > > > If we want Drill to consume Arrow data, then we'd have to play the same > trick in reverse: reinterpret Arrow vectors as Drill vectors, then convert > Arrow metadata to Drill format. > > Building such integration can be done by the community to enable > integration. Granted, this approach is a bit on the "crude-but-effective" > side. But, if the integration proves valuable, then there is justification > for a next round of deeper integration. > > > Charles' original comment from the discussion about project state: > > (quote) > The first [suggested improvement] is the Arrow integration. I’m not > enough of a software engineer to understand > all the internal details here, but as I understand it, the promise of > Arrow is that many tools > will share a common memory model and that it will be possible to transfer > data from one tool > to the other without having to serialize/deserialize the data. In the > data science community > many of the major platforms, Python-pandas, R, and Spark are moving or > have adopted Arrow. > > Drill’s strength is the ease that it can query many different data sources > and if Drill > were to adopt Arrow, I suspect that many people would adopt it as a part > of a machine learning > pipeline. Just recently, I attempted to do some data manipulation using > Spark, and couldn’t > help but notice how difficult ti was in contrast with Drill. I’m sure this > is a very complex > task, but I do think that it could be worth it in the end. > > (unquote) > > Thanks, > - Paul > >
