Hello Paul,

I totally agree that integrating Arrow by simply replacing Vectors usage
everywhere will cause a disaster.
After the first look at the new *E*nhanced*V*ector*F*ramework and based on
your suggestions I think I have an idea to share.
In my opinion, the integration can be done in the two major stages:

*1. Preparation Stage*
       1.1 Extract all EVF and related components to a separate module. So
the new separate module will depend only upon Vectors module.
       1.2 Step-by-step rewriting of all operators to use a higher-level
EVF module and remove Vectors module from exec and modules dependencies.
       1.3 Ensure that only module which depends on Vectors is the new EVF
one.
*2. Integration Stage*
        2.1 Add dependency on Arrow Vectors module into EVF module.
        2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
Vectors & Flatbuffers Meta in EVF module.
        2.3 Finalize integration by removing Drill Vectors module
completely.


*NOTE:* I think that any way we won't preserve any backward compatibility
for drivers and custom UDFs.
And proposed changes are a major step forward to be included in Drill 2.0
version.


Below is the very first list of packages that in future may be transformed
into EVF module:
*Module:* exec/Vectors
*Packages:*
org.apache.drill.exec.record.metadata - (An enhanced set of classes to
describe a Drill schema.)
org.apache.drill.exec.record.metadata.schema.parser

org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
each kind of Drill vector.)
org.apache.drill.exec.vector.accessor.convert
org.apache.drill.exec.vector.accessor.impl
org.apache.drill.exec.vector.accessor.reader
org.apache.drill.exec.vector.accessor.writer
org.apache.drill.exec.vector.accessor.writer.dummy

*Module:* exec/Java Execution Engine
*Packages:*
org.apache.drill.exec.physical.rowSet - (Record batches management)
org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
mgmt)
org.apache.drill.exec.physical.impl.scan - (Row set based scan)

Thanks,
Igor Guzenko

On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Would be good to do some design brainstorming around this.
>
> Integration with other tools depends on the APIs (the first two items I
> mentioned.) Last time I checked (more than a year ago), memory layout of
> Arrow is close to that in Drill; so conversion is around "packaging" and
> metadata, which can be encapsulated in an API.
>
> Converting internals is a major undertaking. We have large amounts of
> complex, critical code that works directly with the details of value
> vectors. My thought was to first convert code to use the column
> readers/writers we've developed. Then, once all internal code uses that
> abstraction, we can replace the underlying vector implementation with
> Arrow. This lets us work in small stages, each of which is deliverable by
> itself.
>
> The other approach is to change all code that works directly with Drill
> vectors to instead work with Arrow. Because that code is so detailed and
> fragile, that is a huge, risky project.
>
> There are other approaches as well. Would be good to explore them before
> we dive into a major project.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre <
> cgi...@gmail.com> wrote:
>
>  Hi Igor,
> That would be really great if you could see that through to completion.
> IMHO, the value from this is not so much performance related but rather the
> ability to use Drill to gather and prep data and seamlessly "hand it off"
> to other platforms for machine learning.
> -- C
>
>
> > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com>
> wrote:
> >
> > Hello Nai and Paul,
> >
> > I would like to contribute full Apache Arrow integration.
> >
> > Thanks,
> > Igor
> >
> > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Nai Yan,
> >>
> >> Integration is still in the discussion stages. Work has been progressing
> >> on some foundations which would help that integration.
> >>
> >> At the Developer's Day we talked about several ways to integrate. These
> >> include:
> >>
> >> 1. A storage plugin to read Arrow buffers from some source so that you
> >> could use Arrow data in a Drill query.
> >>
> >> 2. A new Drill client API that produces Arrow buffers from a Drill query
> >> so that an Arrow-based tool can consume Arrow data from Drill.
> >>
> >> 3. Replacement of the Drill value vectors internally with Arrow buffers.
> >>
> >> The first two are relatively straightforward; they just need someone to
> >> contribute an implementation. The third is a major long-term project
> >> because of the way Drill value vectors and Arrow vectors have diverged.
> >>
> >>
> >> I wonder, which of these use cases is of interest to you? How might you
> >> use that integration in you project?
> >>
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. <
> >> zhaon...@gmail.com> wrote:
> >>
> >> Greetings,
> >>      As mentioned in Drill develper Day 2018, there's a plan for Drill
> to
> >> integrate Arrow (gandiva from Dremio). I was wondering how is going.
> >>
> >>      Thanks in adavance.
> >>
> >>
> >>
> >> Nai Yan
> >>
>

Reply via email to