Re: About integration of drill and arrow

Paul Rogers Wed, 08 Jan 2020 13:45:08 -0800

Hi Igor,

Bingo! You clearly explained the idea that has been simmering since starting on 
the "Row Set" framework (which evolved into EVF.) Some of the early ideas are 
in [1]. At one of the Drill Developer Days, there was brief discussion about 
the approach you propose: creating an API on top of our memory layout, 
modifying operators to use that API, then changing the underlying memory 
representation. You've done a nice job spelling out some of the steps involved.

Though this is a new idea in Drill, it is not a unique concept in the industry. 
As it turns out, Presto already does this. Presto is row-based, with rows 
grouped into batches. (By contrast, Drill is column-based, with columns grouped 
into batches.) Presto supports multiple row formats: as Java objects and, it 
seems, as arrays or direct memory. Point is, Presto provides a row access API 
that hides memory layout from consumers.

Drill, being vector based, is far more complex than Presto, however. Vectors 
have to be pre-allocated and decisions made in one vector need to be reflected 
in the others. (That is, if we have two columns: image ID int and image data 
varbinary, the image data column will get huge fast, but the image ID will say 
small.) Column-based memory layout is Drill's big selling point, but it leads 
to code that is VERY complex compared to the row-based code in Presto. Thus, 
the Drill vector API ends up being more complicated than Presto's row based 
API. As you noted, the column readers and writers (part of EVF), try to handle 
this complexity.

More to say, will follow up in separate notes. Perhaps, we can create a 
document somewhere to capture these ideas. (Markdown in the Wiki attached to 
the Drill project is one possibility; easy to track and share.) Then, if the 
team agrees, we can coordinate the incremental efforts done thus far.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/BH-Future-Work

    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  

 Hello Paul,

I totally agree that integrating Arrow by simply replacing Vectors usage
everywhere will cause a disaster.
After the first look at the new *E*nhanced*V*ector*F*ramework and based on
your suggestions I think I have an idea to share.
In my opinion, the integration can be done in the two major stages:

*1. Preparation Stage*
      1.1 Extract all EVF and related components to a separate module. So
the new separate module will depend only upon Vectors module.
      1.2 Step-by-step rewriting of all operators to use a higher-level
EVF module and remove Vectors module from exec and modules dependencies.
      1.3 Ensure that only module which depends on Vectors is the new EVF
one.
*2. Integration Stage*
        2.1 Add dependency on Arrow Vectors module into EVF module.
        2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
Vectors & Flatbuffers Meta in EVF module.
        2.3 Finalize integration by removing Drill Vectors module
completely.

*NOTE:* I think that any way we won't preserve any backward compatibility
for drivers and custom UDFs.
And proposed changes are a major step forward to be included in Drill 2.0
version.

Below is the very first list of packages that in future may be transformed
into EVF module:
*Module:* exec/Vectors
*Packages:*
org.apache.drill.exec.record.metadata - (An enhanced set of classes to
describe a Drill schema.)
org.apache.drill.exec.record.metadata.schema.parser

org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
each kind of Drill vector.)
org.apache.drill.exec.vector.accessor.convert
org.apache.drill.exec.vector.accessor.impl
org.apache.drill.exec.vector.accessor.reader
org.apache.drill.exec.vector.accessor.writer
org.apache.drill.exec.vector.accessor.writer.dummy

*Module:* exec/Java Execution Engine
*Packages:*
org.apache.drill.exec.physical.rowSet - (Record batches management)
org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
mgmt)
org.apache.drill.exec.physical.impl.scan - (Row set based scan)

Thanks,
Igor Guzenko

On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Would be good to do some design brainstorming around this.
>
> Integration with other tools depends on the APIs (the first two items I
> mentioned.) Last time I checked (more than a year ago), memory layout of
> Arrow is close to that in Drill; so conversion is around "packaging" and
> metadata, which can be encapsulated in an API.
>
> Converting internals is a major undertaking. We have large amounts of
> complex, critical code that works directly with the details of value
> vectors. My thought was to first convert code to use the column
> readers/writers we've developed. Then, once all internal code uses that
> abstraction, we can replace the underlying vector implementation with
> Arrow. This lets us work in small stages, each of which is deliverable by
> itself.
>
> The other approach is to change all code that works directly with Drill
> vectors to instead work with Arrow. Because that code is so detailed and
> fragile, that is a huge, risky project.
>
> There are other approaches as well. Would be good to explore them before
> we dive into a major project.
>
> Thanks,
> - Paul
>
>

Re: About integration of drill and arrow

Reply via email to