Re: About integration of drill and arrow

Paul Rogers Wed, 08 Jan 2020 14:21:27 -0800

Hi Igor,

You mentioned EVF. For those who are newer to the project, let me recap the 
history of EVF and how it fits into the Arrow picture.

The original idea of value vectors was to create long blocks of data that we 
can load into the CPU cache, then apply operations upon without CPU cache 
misses. That is, load a vector of ints, then use CPU vector instructions to sum 
those ints.

Somehow, this lead to the belief that bigger vectors are better. Some years 
ago, we found we were creating vectors of size 16MB, 32MB or larger. (In fact, 
there is no need for vectors larger than the CPU cache size, but that is 
another discussion.)

Drill uses Netty for memory allocation. Turns out, however, that we run into 
memory fragmentation issues when vectors grow in size beyond 16 MB. So, we need 
to limit vector (and thus batch) size. Record readers (part of the Scan 
operator), however, are in an awkward position: they have no idea the size of 
the data that they read. Most just read, say 4K rows. 4K ints is 16K of memory, 
which is fine.

Maybe make the vectors bigger? Maybe 64K rows of 4 bytes each is better, the 
vector is 256K and this is the size of the CPU cache. Perfect! But, what if we 
are loading large text values? Maybe each row is 1 MB in size. In this case, we 
can fit only 16 (not 16K, just 16) VARCHARs into a 16 MB vector. But, how the 
heck does the reader know that ahead of time?

The ResultSetLoader (part of EVF) is designed to handle this: it writes data 
until the vector fills, then does an "overflow shuffle" to hide all the messy 
details from the reader. The reader just asks "please sir, may I write 
another?" and the EVF answers either "yes" or "no."

To accomplish this goal, we ended up needing to wrap most vector functionality 
currently spread over generated code, the vectors themselves, accessors, 
mutators, complex writers, complex readers and more. All these operations need 
to occur in the column readers and writers because they all revolve around the 
current row index (or array index when dealing with nested data.)

BTW: EVF consists of two parts. The column writers (and the awkwardly named 
"ResultSetLoader") does the vector writing. Matcher readers handle access to 
existing vectors. The rest of EVF does scan-specific work such as projection, 
implicit columns, missing columns, data translation, iterating over readers and 
more. Only the column readers and writers are relevant to operators other than 
Scan.

So, the result of solving the memory fragmentation issue is that we now have a 
near-complete API on top of vectors independent of the memory layout and the 
value vector implementation. (There is a bit more work, in flight, to allow 
bulk copies.)

This is why our focus with EVF thus far has been the readers: this is where the 
memory fragmentation issue is the worst. CSV is done. JSON is in flight. Very 
glad to see Arina's work converting Avro, and Charles' work on his format and 
storage plugins. And, of course, your and others help in reviewing the 
supporting PRs.

Others on the team created good interim solutions in the other operators via 
the "batch sizing" work. Still, the hope was that, eventually, the other 
operators would use the column writers for a unified, common solution.

So, this leads to your suggestions about the other operators. More on that 
later.

Thanks,
- Paul

    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  

 Hello Paul,

I totally agree that integrating Arrow by simply replacing Vectors usage
everywhere will cause a disaster.
After the first look at the new *E*nhanced*V*ector*F*ramework and based on
your suggestions I think I have an idea to share.
In my opinion, the integration can be done in the two major stages:

*1. Preparation Stage*
      1.1 Extract all EVF and related components to a separate module. So
the new separate module will depend only upon Vectors module.
      1.2 Step-by-step rewriting of all operators to use a higher-level
EVF module and remove Vectors module from exec and modules dependencies.
      1.3 Ensure that only module which depends on Vectors is the new EVF
one.
*2. Integration Stage*
        2.1 Add dependency on Arrow Vectors module into EVF module.
        2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
Vectors & Flatbuffers Meta in EVF module.
        2.3 Finalize integration by removing Drill Vectors module
completely.

*NOTE:* I think that any way we won't preserve any backward compatibility
for drivers and custom UDFs.
And proposed changes are a major step forward to be included in Drill 2.0
version.

Below is the very first list of packages that in future may be transformed
into EVF module:
*Module:* exec/Vectors
*Packages:*
org.apache.drill.exec.record.metadata - (An enhanced set of classes to
describe a Drill schema.)
org.apache.drill.exec.record.metadata.schema.parser

org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
each kind of Drill vector.)
org.apache.drill.exec.vector.accessor.convert
org.apache.drill.exec.vector.accessor.impl
org.apache.drill.exec.vector.accessor.reader
org.apache.drill.exec.vector.accessor.writer
org.apache.drill.exec.vector.accessor.writer.dummy

*Module:* exec/Java Execution Engine
*Packages:*
org.apache.drill.exec.physical.rowSet - (Record batches management)
org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
mgmt)
org.apache.drill.exec.physical.impl.scan - (Row set based scan)

Thanks,
Igor Guzenko

On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Would be good to do some design brainstorming around this.
>
> Integration with other tools depends on the APIs (the first two items I
> mentioned.) Last time I checked (more than a year ago), memory layout of
> Arrow is close to that in Drill; so conversion is around "packaging" and
> metadata, which can be encapsulated in an API.
>
> Converting internals is a major undertaking. We have large amounts of
> complex, critical code that works directly with the details of value
> vectors. My thought was to first convert code to use the column
> readers/writers we've developed. Then, once all internal code uses that
> abstraction, we can replace the underlying vector implementation with
> Arrow. This lets us work in small stages, each of which is deliverable by
> itself.
>
> The other approach is to change all code that works directly with Drill
> vectors to instead work with Arrow. Because that code is so detailed and
> fragile, that is a huge, risky project.
>
> There are other approaches as well. Would be good to explore them before
> we dive into a major project.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre <
> cgi...@gmail.com> wrote:
>
>  Hi Igor,
> That would be really great if you could see that through to completion.
> IMHO, the value from this is not so much performance related but rather the
> ability to use Drill to gather and prep data and seamlessly "hand it off"
> to other platforms for machine learning.
> -- C
>
>
> > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com>
> wrote:
> >
> > Hello Nai and Paul,
> >
> > I would like to contribute full Apache Arrow integration.
> >
> > Thanks,
> > Igor
> >
> > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Nai Yan,
> >>
> >> Integration is still in the discussion stages. Work has been progressing
> >> on some foundations which would help that integration.
> >>
> >> At the Developer's Day we talked about several ways to integrate. These
> >> include:
> >>
> >> 1. A storage plugin to read Arrow buffers from some source so that you
> >> could use Arrow data in a Drill query.
> >>
> >> 2. A new Drill client API that produces Arrow buffers from a Drill query
> >> so that an Arrow-based tool can consume Arrow data from Drill.
> >>
> >> 3. Replacement of the Drill value vectors internally with Arrow buffers.
> >>
> >> The first two are relatively straightforward; they just need someone to
> >> contribute an implementation. The third is a major long-term project
> >> because of the way Drill value vectors and Arrow vectors have diverged.
> >>
> >>
> >> I wonder, which of these use cases is of interest to you? How might you
> >> use that integration in you project?
> >>
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. <
> >> zhaon...@gmail.com> wrote:
> >>
> >> Greetings,
> >>      As mentioned in Drill develper Day 2018, there's a plan for Drill
> to
> >> integrate Arrow (gandiva from Dremio). I was wondering how is going.
> >>
> >>      Thanks in adavance.
> >>
> >>
> >>
> >> Nai Yan
> >>
>

Re: About integration of drill and arrow

Reply via email to