Hi Igor,

Before diving into design issues, it may be worthwhile to think about the 
premise: should Drill adopt Arrow as its internal memory layout? This is the 
question that the team has wrestled with since Arrow was launched. Arrow has 
three parts. Let's think about each.

First is a direct memory layout. The approach you suggest will let us work with 
the Arrow memory format. Use EVF to access vectors, then the underlying vectors 
can be swapped from Drill to Arrow. But, what is the advantage of using Arrow? 
The arrow layout isn't better than Drill's; it is just different. Adopting the 
Arrow memory layout by itself provides little benefit, but bit cost. This is 
one reason the team has been so reluctant to atop Arrow.

The only advantage of using the Arrow memory layout is if Drill could benefit 
from code written for Arrow. The second part of Arrow is a set of modules to 
manipulate vectors. Gandiva is the most prominent example. However, there are 
major challenges. Most SQL operations are defined to work on rows; some clever 
thinking will be needed to convert those operations into a series of column 
operations. (Drill's codegen is NOT columnar: it works row-by-row.) So, if we 
want to benefit from Gandiva, we must completely rethink how we process batches.

Is it worth doing all that work? The primary benefit would be performance. But, 
it is not clear that our current implementation is the bottleneck. The current 
implementation is row-based, code generated in Java. Would be great for someone 
to do some benchmarks to show the benefit from adopting Gandiva to see if the 
potential gain justifies the likely large development cost.

The third advantage of using Arrow is to allow exchange of vectors between 
Drill and Arrow-based clients or readers. As it turns out, this is not the big 
win it seems. As we've discussed, we could easily create an Arrow-based client 
for Drill -- there will be an RPC between the client and Drill and we can use 
that to do format conversion.

For readers, Drill will want control over batch sizes; Drill cannot blindly 
accept whatever size vectors a reader chooses to produce. (More on that later.) 
Incoming data will be subject to projection and selection, so it will quickly 
move out of the incoming Arrow vectors into vector which Drill creates.

Arrow gets (or got) a lot of press. However, our job is to focus on what's best 
for Drill. There actually might be a memory layout for Drill that is better 
than Arrow (and better than our current vectors.) A couple of us did a 
prototype some time ago that seemed to show promise. So, it is not clear that 
adopting Arrow is necessarily a huge win: maybe it is, maybe not. We need to 
figure it out.

What IS clearly a huge win is the idea you outlined: creating a layer between 
memory layout and the rest of Drill so that we can try out different memory 
layouts to see what works best.

Thanks,
- Paul

 

    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  
 
 Hello Paul,

I totally agree that integrating Arrow by simply replacing Vectors usage
everywhere will cause a disaster.
After the first look at the new *E*nhanced*V*ector*F*ramework and based on
your suggestions I think I have an idea to share.
In my opinion, the integration can be done in the two major stages:

*1. Preparation Stage*
      1.1 Extract all EVF and related components to a separate module. So
the new separate module will depend only upon Vectors module.
      1.2 Step-by-step rewriting of all operators to use a higher-level
EVF module and remove Vectors module from exec and modules dependencies.
      1.3 Ensure that only module which depends on Vectors is the new EVF
one.
*2. Integration Stage*
        2.1 Add dependency on Arrow Vectors module into EVF module.
        2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
Vectors & Flatbuffers Meta in EVF module.
        2.3 Finalize integration by removing Drill Vectors module
completely.


*NOTE:* I think that any way we won't preserve any backward compatibility
for drivers and custom UDFs.
And proposed changes are a major step forward to be included in Drill 2.0
version.


Below is the very first list of packages that in future may be transformed
into EVF module:
*Module:* exec/Vectors
*Packages:*
org.apache.drill.exec.record.metadata - (An enhanced set of classes to
describe a Drill schema.)
org.apache.drill.exec.record.metadata.schema.parser

org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
each kind of Drill vector.)
org.apache.drill.exec.vector.accessor.convert
org.apache.drill.exec.vector.accessor.impl
org.apache.drill.exec.vector.accessor.reader
org.apache.drill.exec.vector.accessor.writer
org.apache.drill.exec.vector.accessor.writer.dummy

*Module:* exec/Java Execution Engine
*Packages:*
org.apache.drill.exec.physical.rowSet - (Record batches management)
org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
mgmt)
org.apache.drill.exec.physical.impl.scan - (Row set based scan)

Thanks,
Igor Guzenko

Reply via email to