Re: [DISCUSS] Restarting the Arrow Conversation

Ted Dunning Mon, 03 Jan 2022 11:03:12 -0800

As a little bit of perspective from somebody who *was* at MapR at the time,
here are my recollections.


Arrow is pretty much the value vectors from Drill with some lessons learned
and all dependencies removed so that Arrow can be consumed separately from
Drill.

The spinout of the Dremio team didn't happen because of the lack of
integration with Arrow ... it was more the other way around ... because a
significant chunk of the Drill team left to form Dremio, the driving force
that could have pushed for integration just wasn't around any more because
they were off doing Dremio and weren't working on Drill any more very much.
The motive for the spinout had mostly to do with the fact that Tomer and
Jacques recognized the opportunity to build a largely in-memory analytical
engine based on zero serialization techniques and also recognized that this
could never be a priority for MapR because it was outside the center of
mass there. Once the Dremio team was out, though, they had a huge need for
interoperability with systems like Spark and Cassandra, and they needed to
not impose all of Drill as a dependency if they wanted these other systems
to take on Arrow.

This history doesn't really impact the merits or methods of integrating
present-day Drill with Arrow, but it is nice to get the story the right way
around.



On Mon, Jan 3, 2022 at 8:00 AM Charles Givre <[email protected]> wrote:

> Hello all,
> There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done.  I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio.  This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this.   Please forgive (or correct) any inaccuracies.
>
> When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data.  From
> my understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration.  The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
>
> I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
>
> With that said, there was a renewed discussion on the list [2] where Paul
> Rogers proposed what he described as a "Crude but Effective" approach to an
> Arrow integration.
>
> This is in the email link but here was a part of Paul's email:
>
> > Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >
> > A crude first-approximation of the development steps:
> > 1. Create the client shell server.
> > 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> > 3. Forward the query to Drill using the native Drill client.
> > 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> > 5. Then, solve that memory allocator problem to pass data without
> copying.
>
> One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill.  Of
> course, this could be something for Drill 2.0.  At a minimum, could we take
> the conversation off of the PR and put it in the email list? ;-)
>
> Let's discuss... All ideas are welcome!
>
> Best,
> -- C
>
>
> [1]: https://github.com/apache/drill/pull/2412 <
> https://github.com/apache/drill/pull/2412>
> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l <
> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
>
>
>
>

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to