Re: About integration of drill and arrow

Igor Guzenko Fri, 10 Jan 2020 13:40:06 -0800

Hello All,

Volodymyr, from your last response, I can figure out that what you are
suggesting is not the actual integration plan, right? To me, it sounds like
we can create some POC
to evaluate whether the migration to Arrow makes sense. And if it actually
makes sense, integrate through the long way, aka create evf-api,
evf-drill-vectors, evf-arrow-vectors
and do all the operators rewriting to EVF until we can completely switch to
evf-arrow-vectors memory.


In case if I understood you wrong. You mean integration plan and you still
want to make possible, let's say combining Drill-based and Arrow-EVF-based
operators within one DAG.
I want to ask Paul whether it's possible to do so with EVF avoiding the
adaptation problems described in my previous email.

Let's look at the simple example first:
             Join
            /       \
DrillScan    ArrowEvfScan

The Join operator will only work in such a case if it's updated to work
with left and right batches as with some set of rows using generic EVF
API.  If it's not updated,
I mean based on old Vectors low-level details, it's broken without the
Arrow-to-Drill vector adapters. I don't see how EVF can help in this worst
case.

Thanks,
Igor

On Fri, Jan 10, 2020 at 7:55 PM Volodymyr Vysotskyi <volody...@apache.org>
wrote:

> Hi Paul and Igor,
>
> It is great that the discussion has affected high-level questions of the
> effort and benefits of moving to the Arrow.
> The main arguments of moving to Arrow for me were possible performance
> improvements (perhaps with Gandiva usage) and significant codebase
> improvements (perhaps with the additional bug fixes) compared to current
> Drill's vectors code.
>
> In my previous letter, I concentrated on the case when we will agree to
> completely move to the Arrow and proposed the way, which in my opinion
> would be more optimal and splittable.
>
> I understand your concerns about having mixed two systems.
> I don't propose to recommend Drill users to enable Arrow usage while a lot
> of conversions between batches would be happening.
> But without trying to use Arrow classes step-by-step, we risk to end up
> with dozens of unresolved Arrow-related issues,
> a lot of changes in unmerged branches and merge conflicts after every
> new commit into the master branch.
>
> Regarding unnecessary complexity connected with adapting Arrow and Drill
> vectors to work together, this conversion may be done at EVF API level, no
> need to dive into the vectors implementation details and issues of each
> side, we just need to extend EVF API if required, and just be sure that
> both implementations still works correctly with that API.
> Even with your strategy, we should adopt EVF to work with Arrow, I just
> propose to do it at the beginning and additionally preserve EVF with Drill
> value vectors.
>
> Also, when we will be able to switch between Arrow and Drill even for some
> operator kinds, we would be able to estimate how the operator performance
> was changed and make the decision to continue the integration having real
> numbers.
> And for the case, if we wouldn't be satisfied with the performance changes,
> we may stop integrating before rewriting all the project code. But in this
> case, we would have minimum changes, enough for just using Arrow as a data
> source and data sink for easier integration with other projects. I think
> this is a required minimum we should provide independently on the decision
> about deeper integration.
>
> Kind regards,
> Volodymyr Vysotskyi
>
>
> On Fri, Jan 10, 2020 at 1:47 PM Igor Guzenko <ihor.huzenko....@gmail.com>
> wrote:
>
> > Hello Drill Developers and Drill Users,
> >
> > This discussion started as migration to Arrow but uncovered questions of
> > strategical plans for moving towards Apache Drill 2.0.
> > Below are my personal thoughts of what we, as developers, should do to
> > offer Drill users better experience:
> >
> > 1. High performant bulk insertions into as many data sources as possible.
> > There is a whole bunch of different tools for data pipelining to use...
> > But why people who know SQL should spend time learning something new for
> > simply moving data between tools?
> >
> > 2. Improve the efficiency of memory management (EVF, resource management,
> > improved costs planning using meta store, etc.). Since we're dealing with
> > big data alongside other tools installed on data nodes we should utilize
> > memory very economically and effectively.
> >
> > 3. Make integration with all other tools and formats as stable as
> possible.
> > The high amount of bugs in the area tells that we have lots to improve.
> > Every user is happy when he gets a tool and it simply works as expected.
> > Also, analyze user requirements and provide integration with new most
> > popular tools.  Querying high variety of
> > data sources were and still one of the biggest selling points.
> >
> > 4. Make code highly extensible and extremely friendly for contributions.
> No
> > one would want to spend years of learning to make a contribution. This is
> > why I want to see a lot of modules that are highly cohesive and define
> > clear APIs for interaction with each other. This is also about paying old
> > technical debts related to fat JDBC client, copy of web server in Drill
> on
> > YARN, mixing everything in exec module, etc.
> >
> > 5. Focus on performance improvements of every component, from query
> > planning to execution.
> >
> > These are my thoughts from developer's perspective. Since I'm just
> > developer from Ukraine and far far away from Drill users, I believe that
> > Charles Givre is the one who can build a strong Drill user community and
> > collect their requirements for us.
> >
> >
> > What relates to Volodymyr's suggestion about adapting Arrow and Drill
> > vectors to work together (the same step is required to implement an Arrow
> > client, suggested by Paul).
> > I'm totally against the idea because it brings a huge amount of
> unnecessary
> > complexity just to uncover small insides into the integration. First is
> > that this is against the whole idea of Arrow since the main idea of Arrow
> > is to provide unified columnar memory layout between different tools
> > without any data conversions. But the step exactly requires data
> > conversions, at least our nullability vector and their validity bitmaps
> are
> > not the same, also Dict vector and their meaning of Dict may also cause
> > data conversion.
> > Another waste is the difference in metadata contracts, who knows whether
> > it's even possible to combine them. Another problem, like I already
> > mentioned is the huge complexity of the work,
> > To do the work I should overcome all underlying pitfalls of both
> projects,
> > in addition, I should cover all the untestable code with a comprehensive
> > amount of tests to show that back and forth conversion is done correctly
> > for every single unit of data in both vectors. The idea of adapters and
> > clients is about 4 years old or more and no one did practical work to
> > implement it. I think I explained why.
> >
> > What I really like in Volodymyr's and Paul's suggestions is that we can
> > extract clear API from existing EVF implementation and in practice
> provide
> > Arrow or any other implementation for it. Who knows, maybe with new
> > improved garbage collectors using direct memory is not necessary at all?
> It
> > is quite clear what we need the middle layer between operators and
> memory,
> > we need extensive benchmarks over the layer and experiments to show what
> is
> > the best underlying memory for Drill.
> >
> > What about client tools compatibility there is only one solution I can
> see
> > is to provide new clients for Drill 2.0, although I agree that this is a
> > tremendous amount of work there is no other way for making major steps
> into
> > the future. Without it, we should lay back and watch while Drill is
> slowly
> > dying and giving up to its competitors.
> >
> > NOTE: I want to encourage everyone to join the discussion and share
> vision
> > of what should be included in Drill 2.0 and what are strategic points we
> > want to achieve in the future.
> >
> > Kind regards,
> > Igor
> >
> >
> > On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Volodymyr,
> > >
> > > All good points. The Arrow/Drill conversion is a good option,
> especially
> > > for readers and clients. Between operators, such conversion is likely
> to
> > > introduce performance hits. As you know, the main feature that
> > > differentiates one query engine from another is performance, so adding
> > > conversions is unlikely to help Drill in the performance battle.
> > >
> > > Flatten should actually be pretty simple with EVF. Creating repeated
> > > values is much like filling in implicit columns: set a value, then
> "copy
> > > down" n times.
> > >
> > > Still, you raise good issues. Operators that fit your description are
> > > things like exchanges: these operators want to work at the low level of
> > > buffers. Not much the column readers/writers can do to help. And, as
> you
> > > point out, commercial components will be a challenge as Apache Drill
> does
> > > not maintain that code.
> > >
> > > Your larger point is valid: no matter how we approach it, moving to
> Arrow
> > > is a large project that will break compatibility.
> > >
> > > We've discussed a simple first step: Support an Arrow client to see if
> > > there is any interest. Support Arrow readers to see if that gives us
> any
> > > benefit. These are the visible tip of the iceberg. If we see
> advantages,
> > we
> > > can then think about changing the internals; the vast bulk of the
> iceberg
> > > which is below water and unseen.
> > >
> > > I think I disagree that we'd want to swap code that works directly with
> > > ValueVectors to code that works directly with ArrowVectors. Doing so
> > locks
> > > us into someone else's memory format and forces Drill to change every
> > time
> > > Arrow changes. Give the small size of the Drill team, and the frantic
> > pace
> > > of Arrow change, This was the team's concern early on and I'm still not
> > > convinced this is a good strategy.
> > >
> > > So, my original point remains: what is the benefit of all this cost? Is
> > > there another path that gives us greater benefit for the same or lesser
> > > cost? In short, what is our goal? Where do we want Drill to go?
> > >
> > > In fact, another radical suggestion is to embrace the wonderful work
> done
> > > on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support
> > for
> > > local files (Drill's unique strength), and embrace Presto's great
> support
> > > for data types, connectors, UDFs, clients and so on.
> > >
> > > As a team, we should ask the fundamental question: What benefits can
> > Drill
> > > offer that are not already offered by, say Presto or commercial Drill
> > > derivatives? If new can answer that question, we'll have a better idea
> > > about whether investment in Arrow will get us there.
> > >
> > > Or, are we better off to just leave well enough alone as we have done
> for
> > > several years?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >     On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi
> <
> > > volody...@apache.org> wrote:
> > >
> > >  Hi all,
> > >
> > > Glad to see that this discussion became active again!
> > >
> > > I have some comments regarding the steps for moving from Drill Vectors
> to
> > > Arrow Vectors.
> > >
> > > No doubt that using EVF for all operators and readers instead of value
> > > vectors will simplify things a lot.
> > > But considering the target goal - integration with Arrow, it may be the
> > > main show-stopper for it.
> > > There may be some operators which would be hard to adapt to use EVF,
> for
> > > example, I think Flatten operator will be among them since its
> > > implementation deeply connected with value vectors.
> > > Also, it requires moving all storage and format plugins to EVF, which
> > also
> > > may be problematic, for example, some plugins like MaprDB have specific
> > > features, and it should be considered when moving to EVF.
> > > Some other plugins are so obsolete, that I'm not sure that they still
> > work
> > > and that someone still uses it, so except moving to EVF, they should be
> > > resurrected to verify that they weren't broken more than before.
> > >
> > > This is a huge piece of work, and only after that, we will proceed with
> > the
> > > next step - integrating Arrow to EVF and then handling new
> Arrow-related
> > > issues for all the operators and readers at the same time.
> > >
> > > I propose to update these steps a little bit.
> > > 1. I agree that at first, we should extract EVF-related classes into a
> > > separate module.
> > > 2. But as the next step, I propose to extract EVF API which doesn't
> > depend
> > > on the vector implementation (Drill vectors, or Arrow ones).
> > > 3. After that, introduce module with Arrow which also implements this
> EVF
> > > API.
> > > 4. Introduce transformers that will be able to convert from Drill
> vectors
> > > into Arrow vectors and vice versa. These transformers may be
> implemented
> > to
> > > work using EVF abstractions instead of operating with specific vector
> > > implementations.
> > >
> > > 5.1. At this point, we can introduce Arrow connectors to fetch the data
> > in
> > > Arrow format or return it in such a format using transformers from step
> > 4.
> > >
> > > 5.2. Also, at this point, we may start rewriting operators to EVF and
> > > switching EVF implementation from the EVF based on Drill Vectors to the
> > > implementation which uses Arrow Vectors. Or switching implementations
> for
> > > existing EVF-based format plugins and fix newly discovered issues in
> > Arrow.
> > > Since at this point we will have operators which use Arrow format and
> > > operators which use Drill Vectors format, we should insert operators
> that
> > > transform one vector format to another introduced in step 4 between
> every
> > > pair of operators which returns batches in a different format.
> > >
> > > I know, that such an approach requires some additional work, like
> > > introducing transformers from step 4 and may cause some performance
> > > degradations for the case when format transformation is complex for
> some
> > > types and when we still have sequences of operators with different
> > formats.
> > >
> > > But with this approach, transitioning to Arrow wouldn't be blocked
> until
> > > everything is moved to EVF and it would be possible to transmit
> > > step-by-step, and Drill still will be able to switch between formats if
> > it
> > > would be required.
> > >
> > > Kind regards,
> > > Volodymyr Vysotskyi
> > >
> > >
> > > On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <
> ihor.huzenko....@gmail.com>
> > > wrote:
> > >
> > > > Hi Paul,
> > > >
> > > > Though I have very limited knowledge about Arrow at the moment, I can
> > > > highlight a few advantages of trying it:
> > > >  1. Allows fixing all the long-standing nullability issues and
> provide
> > > > better integration for storage plugins like Hive.
> > > >                https://jira.apache.org/jira/browse/DRILL-1344
> > > >                https://jira.apache.org/jira/browse/DRILL-3831
> > > >                https://jira.apache.org/jira/browse/DRILL-4824
> > > >                https://jira.apache.org/jira/browse/DRILL-7255
> > > >                https://jira.apache.org/jira/browse/DRILL-7366
> > > > 2. Some work was done by community to implement optimized Arrow
> readers
> > > for
> > > > Parquet and other formats&tools. We could try to adopt and check
> > whether
> > > we
> > > > can benefit from them.
> > > > 3. Since Arrow is under active development we could try their newest
> > > > features, like Flight which promises improved data transfers over the
> > > > network.
> > > >
> > > > Thanks,
> > > > Igor
> > > > On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers
> <par0...@yahoo.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Igor,
> > > > >
> > > > > Before diving into design issues, it may be worthwhile to think
> about
> > > the
> > > > > premise: should Drill adopt Arrow as its internal memory layout?
> This
> > > is
> > > > > the question that the team has wrestled with since Arrow was
> > launched.
> > > > > Arrow has three parts. Let's think about each.
> > > > >
> > > > > First is a direct memory layout. The approach you suggest will let
> us
> > > > work
> > > > > with the Arrow memory format. Use EVF to access vectors, then the
> > > > > underlying vectors can be swapped from Drill to Arrow. But, what is
> > the
> > > > > advantage of using Arrow? The arrow layout isn't better than
> Drill's;
> > > it
> > > > is
> > > > > just different. Adopting the Arrow memory layout by itself provides
> > > > little
> > > > > benefit, but bit cost. This is one reason the team has been so
> > > reluctant
> > > > to
> > > > > atop Arrow.
> > > > >
> > > > > The only advantage of using the Arrow memory layout is if Drill
> could
> > > > > benefit from code written for Arrow. The second part of Arrow is a
> > set
> > > of
> > > > > modules to manipulate vectors. Gandiva is the most prominent
> example.
> > > > > However, there are major challenges. Most SQL operations are
> defined
> > to
> > > > > work on rows; some clever thinking will be needed to convert those
> > > > > operations into a series of column operations. (Drill's codegen is
> > NOT
> > > > > columnar: it works row-by-row.) So, if we want to benefit from
> > Gandiva,
> > > > we
> > > > > must completely rethink how we process batches.
> > > > >
> > > > > Is it worth doing all that work? The primary benefit would be
> > > > performance.
> > > > > But, it is not clear that our current implementation is the
> > bottleneck.
> > > > The
> > > > > current implementation is row-based, code generated in Java. Would
> be
> > > > great
> > > > > for someone to do some benchmarks to show the benefit from adopting
> > > > Gandiva
> > > > > to see if the potential gain justifies the likely large development
> > > cost.
> > > > >
> > > > > The third advantage of using Arrow is to allow exchange of vectors
> > > > between
> > > > > Drill and Arrow-based clients or readers. As it turns out, this is
> > not
> > > > the
> > > > > big win it seems. As we've discussed, we could easily create an
> > > > Arrow-based
> > > > > client for Drill -- there will be an RPC between the client and
> Drill
> > > and
> > > > > we can use that to do format conversion.
> > > > >
> > > > > For readers, Drill will want control over batch sizes; Drill cannot
> > > > > blindly accept whatever size vectors a reader chooses to produce.
> > (More
> > > > on
> > > > > that later.) Incoming data will be subject to projection and
> > selection,
> > > > so
> > > > > it will quickly move out of the incoming Arrow vectors into vector
> > > which
> > > > > Drill creates.
> > > > >
> > > > > Arrow gets (or got) a lot of press. However, our job is to focus on
> > > > what's
> > > > > best for Drill. There actually might be a memory layout for Drill
> > that
> > > is
> > > > > better than Arrow (and better than our current vectors.) A couple
> of
> > us
> > > > did
> > > > > a prototype some time ago that seemed to show promise. So, it is
> not
> > > > clear
> > > > > that adopting Arrow is necessarily a huge win: maybe it is, maybe
> > not.
> > > We
> > > > > need to figure it out.
> > > > >
> > > > > What IS clearly a huge win is the idea you outlined: creating a
> layer
> > > > > between memory layout and the rest of Drill so that we can try out
> > > > > different memory layouts to see what works best.
> > > > >
> > > > > Thanks,
> > > > > - Paul
> > > > >
> > > > >
> > > > >
> > > > >    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
> > > > > ihor.huzenko....@gmail.com> wrote:
> > > > >
> > > > >  Hello Paul,
> > > > >
> > > > > I totally agree that integrating Arrow by simply replacing Vectors
> > > usage
> > > > > everywhere will cause a disaster.
> > > > > After the first look at the new *E*nhanced*V*ector*F*ramework and
> > based
> > > > on
> > > > > your suggestions I think I have an idea to share.
> > > > > In my opinion, the integration can be done in the two major stages:
> > > > >
> > > > > *1. Preparation Stage*
> > > > >      1.1 Extract all EVF and related components to a separate
> module.
> > > So
> > > > > the new separate module will depend only upon Vectors module.
> > > > >      1.2 Step-by-step rewriting of all operators to use a
> > higher-level
> > > > > EVF module and remove Vectors module from exec and modules
> > > dependencies.
> > > > >      1.3 Ensure that only module which depends on Vectors is the
> new
> > > EVF
> > > > > one.
> > > > > *2. Integration Stage*
> > > > >        2.1 Add dependency on Arrow Vectors module into EVF module.
> > > > >        2.2 Replace all usages of Drill Vectors & Protobuf Meta with
> > > > Arrow
> > > > > Vectors & Flatbuffers Meta in EVF module.
> > > > >        2.3 Finalize integration by removing Drill Vectors module
> > > > > completely.
> > > > >
> > > > >
> > > > > *NOTE:* I think that any way we won't preserve any backward
> > > compatibility
> > > > > for drivers and custom UDFs.
> > > > > And proposed changes are a major step forward to be included in
> Drill
> > > 2.0
> > > > > version.
> > > > >
> > > > >
> > > > > Below is the very first list of packages that in future may be
> > > > transformed
> > > > > into EVF module:
> > > > > *Module:* exec/Vectors
> > > > > *Packages:*
> > > > > org.apache.drill.exec.record.metadata - (An enhanced set of classes
> > to
> > > > > describe a Drill schema.)
> > > > > org.apache.drill.exec.record.metadata.schema.parser
> > > > >
> > > > > org.apache.drill.exec.vector.accessor - (JSON-like readers and
> > writers
> > > > for
> > > > > each kind of Drill vector.)
> > > > > org.apache.drill.exec.vector.accessor.convert
> > > > > org.apache.drill.exec.vector.accessor.impl
> > > > > org.apache.drill.exec.vector.accessor.reader
> > > > > org.apache.drill.exec.vector.accessor.writer
> > > > > org.apache.drill.exec.vector.accessor.writer.dummy
> > > > >
> > > > > *Module:* exec/Java Execution Engine
> > > > > *Packages:*
> > > > > org.apache.drill.exec.physical.rowSet - (Record batches management)
> > > > > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with
> > memory
> > > > > mgmt)
> > > > > org.apache.drill.exec.physical.impl.scan - (Row set based scan)
> > > > >
> > > > > Thanks,
> > > > > Igor Guzenko
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: About integration of drill and arrow

Reply via email to