Hi Igor,

Thanks much for volunteering to create some POCs for our various options! It is 
not entirely obvious what we want to test, so let's think about it a bit.

We want to identify those areas that are either the biggest risk or benefit to 
performance. We want to do that without the cost of actually implementing the 
target solutions. Instead, we want to find clever ways to test specific 
components or scenarios.

Performance is notoriously hard to predict, but we have to start somewhere. One 
place to start is with our knowledge of where vectors have advantages (network 
transfers) and where they might have costs (partition exchanges, code gen.) 
Would be great if others can suggest areas based on their experience.

First, we don't need to test the parts that would be the same between two 
options. So, if both Drill vectors and Arrow are direct memory formats directly 
consumable by Netty, we probably don't need to compare network performance 
between the two options.

On the other hand, a plain Java solution would require 
serialization/deserialization of objects, so we'd want to look into the many 
serialization options available and compare them with, say Value Vectors. We 
could, say, create a simple client and server that transfers batches of records 
of various widths so we can see the cost of serialization. It could be that 
Netty isn't even the right choice for plain Java, perhaps we want to compare 
Value Vectors/Netty with plain java/GRPC (or whatever is the most 
state-of-the-art choice.) We might look at what Presto and Spark do in order to 
get some hints.

Next, we'd want to know how much faster Arrow is compared to value vectors for 
common operations. Maybe simulate writing, then reading, batches of records of 
various sizes. For apples-to-apples comparisons, both can be done at the vector 
API level. Because Drill is an SQL engine, maybe write record-by-record. If 
Arrow has optimizations for certain operations (filters maybe? WHERE x = 10, 
say), try the Drill approach (row-by-row, read vector x, compare with 10 and 
set an offset vector. Compare that with whatever Arrow does (LLVM code gen?) We 
could then do the same operations with plain Java; expanding on the little 
prototype we just did.

Third, we should consider code gen. We can assume that Java code gen for Arrow 
will be substantially similar to that for value vectors (though the detailed 
holders, readers and the rest will differ.) If we choose to code gen to the 
column accessor API, then code gen would ideally be nearly identical for either 
implementation, and we would not need to run tests.

However, if we used plain java, code gen would be far simpler (just directly 
access class fields). So, we could maybe take a sizable chunk of generated code 
(maybe for a TPC-H query Project operation, that has a number of calculations), 
and we could hand-write the plain Java equivalent. (This is not as hard as it 
sounds, most of our generated code is boiler-plate not needed for plain java.) 
Then we could compare compile performance (compiler + byte code fix-up for 
value vectors, compile only for Java), along with runtime performance (run the 
code against, say, 100 canned, well-known batches.)

If Arrow can do low-level code gen, we can perhaps compare the performance of 
an Arrow implementation (with its code gen added to the Drill code gen) against 
the value vector version.

Fourth, we want to understand how the memory allocators work under load. For 
example, we might simulate a loaded Drillbit that is doing, say, buffered 
exchanges, hash joins or sorts. A good simulation might be to create, buffer, 
read and release many batches concurrently in many threads. This will tell us 
how well the memory allocators behave when memory is heavily utilized (see how 
close to 100% usage we can get) and concurrency is high (at what thread count 
does throughput start dropping due to memory manager contention?)

More broadly, we might simulate complex operations such as sort or join. As 
above, we don't need the full operation: we just identify the most expensive 
part (copying rows, say, or simulating a selection vector remover.)
 Fifth, and probably more difficult, would be to understand the impact on 
network exchanges. Simulate exchanges on a large cluster (the issue I mentioned 
perviously.) How much memory & cost is needed for the vector-based slice & 
buffer approach we use today? How might that change if we use plain Java and 
can just shuffle & send individual Java rows? We don't actually need a large 
cluster: we could use a smaller cluster and tell Drill to create 3x or 5x 
fragments compared to CPUs. (Or, even better, a whole bunch of this kind of 
testing was done by the previous Drill team at MapR, would be great if we could 
dig up those results.)

Another possible approach, if someone has access to a decent test cluster, is 
to run the TPC-H queries on both Drill and Presto. Presto appears to mostly use 
a heap-memory approach, so we could get a crude comparison there. If anyone 
knows of an Arrow-based, open source tool that can do TPC-H, we could compare 
that to Dill's value-vector approach to get another crude estimate. Such tests 
are not idea; since they will also measure things like Parquet read efficiency, 
DAG pipelining and the rest. But, it would provide useful data points.

With these basics available, we can try variations. For example, I can try to 
find the branch for that fixed-size-block prototype we did a few years back and 
we compare that to Arrow and value vectors. Plus whatever else we think up.

That's a quick initial outline of what we could do. I'm sure folks will suggest 
ways to simplify and sharpen the set of tests. In any event, a very good first 
step is to get familiar with Arrow to suggest tests that would best demonstrate 
it's advantages.

The ideal result out of all this would be to learn that 1) the difference in 
performance between the options is too small to justify the cost of making a 
switch, or 2) one of the options is so obviously better that it is clear what 
we should do. Or course, the real answer is likely to be somewhere in the 
middle.

Suggestions?

Thanks,
- Paul

 

    On Monday, January 13, 2020, 10:52:14 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  
 
 Hi Paul and Volodymyr,

Thank you very much Volodymyr and Paul for defining the good migration
strategy. It really should work for a smooth migration.

What also I really like in the discussion is that excellent questions
appeared:
  - Aren't we just suffering from premature optimizations?
  - Were the whole vectors and buffers' complexity actually required to
achieve performance improvements?
  - What will be without the complex codegen? Is it possible that JIT helps
to avoid it?

I think we finally defined the 3 options to compare:
  - plain Java with few different GCs optimized for a big heap size
  - Arrow vectors
  - fixed-size Drill vectors.

So as the first step, I want to develop POC and compare these options.
Since you're much more experienced with Drill than I, could you please help
me to define the minimal set of operators to compare?

Thanks,
Igor

  

Reply via email to