Hi Igor, Thanks much for volunteering to create some POCs for our various options! It is not entirely obvious what we want to test, so let's think about it a bit.
We want to identify those areas that are either the biggest risk or benefit to performance. We want to do that without the cost of actually implementing the target solutions. Instead, we want to find clever ways to test specific components or scenarios. Performance is notoriously hard to predict, but we have to start somewhere. One place to start is with our knowledge of where vectors have advantages (network transfers) and where they might have costs (partition exchanges, code gen.) Would be great if others can suggest areas based on their experience. First, we don't need to test the parts that would be the same between two options. So, if both Drill vectors and Arrow are direct memory formats directly consumable by Netty, we probably don't need to compare network performance between the two options. On the other hand, a plain Java solution would require serialization/deserialization of objects, so we'd want to look into the many serialization options available and compare them with, say Value Vectors. We could, say, create a simple client and server that transfers batches of records of various widths so we can see the cost of serialization. It could be that Netty isn't even the right choice for plain Java, perhaps we want to compare Value Vectors/Netty with plain java/GRPC (or whatever is the most state-of-the-art choice.) We might look at what Presto and Spark do in order to get some hints. Next, we'd want to know how much faster Arrow is compared to value vectors for common operations. Maybe simulate writing, then reading, batches of records of various sizes. For apples-to-apples comparisons, both can be done at the vector API level. Because Drill is an SQL engine, maybe write record-by-record. If Arrow has optimizations for certain operations (filters maybe? WHERE x = 10, say), try the Drill approach (row-by-row, read vector x, compare with 10 and set an offset vector. Compare that with whatever Arrow does (LLVM code gen?) We could then do the same operations with plain Java; expanding on the little prototype we just did. Third, we should consider code gen. We can assume that Java code gen for Arrow will be substantially similar to that for value vectors (though the detailed holders, readers and the rest will differ.) If we choose to code gen to the column accessor API, then code gen would ideally be nearly identical for either implementation, and we would not need to run tests. However, if we used plain java, code gen would be far simpler (just directly access class fields). So, we could maybe take a sizable chunk of generated code (maybe for a TPC-H query Project operation, that has a number of calculations), and we could hand-write the plain Java equivalent. (This is not as hard as it sounds, most of our generated code is boiler-plate not needed for plain java.) Then we could compare compile performance (compiler + byte code fix-up for value vectors, compile only for Java), along with runtime performance (run the code against, say, 100 canned, well-known batches.) If Arrow can do low-level code gen, we can perhaps compare the performance of an Arrow implementation (with its code gen added to the Drill code gen) against the value vector version. Fourth, we want to understand how the memory allocators work under load. For example, we might simulate a loaded Drillbit that is doing, say, buffered exchanges, hash joins or sorts. A good simulation might be to create, buffer, read and release many batches concurrently in many threads. This will tell us how well the memory allocators behave when memory is heavily utilized (see how close to 100% usage we can get) and concurrency is high (at what thread count does throughput start dropping due to memory manager contention?) More broadly, we might simulate complex operations such as sort or join. As above, we don't need the full operation: we just identify the most expensive part (copying rows, say, or simulating a selection vector remover.) Fifth, and probably more difficult, would be to understand the impact on network exchanges. Simulate exchanges on a large cluster (the issue I mentioned perviously.) How much memory & cost is needed for the vector-based slice & buffer approach we use today? How might that change if we use plain Java and can just shuffle & send individual Java rows? We don't actually need a large cluster: we could use a smaller cluster and tell Drill to create 3x or 5x fragments compared to CPUs. (Or, even better, a whole bunch of this kind of testing was done by the previous Drill team at MapR, would be great if we could dig up those results.) Another possible approach, if someone has access to a decent test cluster, is to run the TPC-H queries on both Drill and Presto. Presto appears to mostly use a heap-memory approach, so we could get a crude comparison there. If anyone knows of an Arrow-based, open source tool that can do TPC-H, we could compare that to Dill's value-vector approach to get another crude estimate. Such tests are not idea; since they will also measure things like Parquet read efficiency, DAG pipelining and the rest. But, it would provide useful data points. With these basics available, we can try variations. For example, I can try to find the branch for that fixed-size-block prototype we did a few years back and we compare that to Arrow and value vectors. Plus whatever else we think up. That's a quick initial outline of what we could do. I'm sure folks will suggest ways to simplify and sharpen the set of tests. In any event, a very good first step is to get familiar with Arrow to suggest tests that would best demonstrate it's advantages. The ideal result out of all this would be to learn that 1) the difference in performance between the options is too small to justify the cost of making a switch, or 2) one of the options is so obviously better that it is clear what we should do. Or course, the real answer is likely to be somewhere in the middle. Suggestions? Thanks, - Paul On Monday, January 13, 2020, 10:52:14 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hi Paul and Volodymyr, Thank you very much Volodymyr and Paul for defining the good migration strategy. It really should work for a smooth migration. What also I really like in the discussion is that excellent questions appeared: - Aren't we just suffering from premature optimizations? - Were the whole vectors and buffers' complexity actually required to achieve performance improvements? - What will be without the complex codegen? Is it possible that JIT helps to avoid it? I think we finally defined the 3 options to compare: - plain Java with few different GCs optimized for a big heap size - Arrow vectors - fixed-size Drill vectors. So as the first step, I want to develop POC and compare these options. Since you're much more experienced with Drill than I, could you please help me to define the minimal set of operators to compare? Thanks, Igor