> > > 1. How much slower is the current Arrow API, compared to directly accessing > off-heap memory? > > According to my (intuitive) experience in vectorizing Flink, the current > API is much slower, at least one or two orders of magnitude slower. > I am sorry I do not have the exact number. However, the conclusion can be > expected to hold true: Parth's experience on Drill also confirms the > conclusion. > In fact, we are working on it. ARROW-5209 is about introducing performance > benchmarks and once that is done, the number will be clear. >
Are you comparing to a situation where you can crash the JVM versus one where you cannot? Let's make sure we're comparing apples to apples. > > 2. Why is current Arrow APIs so slow? > > I think the main reason is too many function calls. I believe each function > call is highly optimized and only carries out simple work. However, the > number of calls is large. > The example in our online doc gives a simple example: a single call to > Float8Vector.get method (which is an API fundamental enough) involves > nearly 30 method calls. That is just too much overhead, especially for > performance-critical scenarios, like SQL engines. > Are they? Who is asking that? I haven't heard that feedback at all and we use the Arrow APIs extensively in Dremio and compete very well with other SQL engines. The APIs were designed with the perspective that they need to protect themselves in the context of the JVM so that a random user doesn't hurt themselves. It sounds like maybe you don't agree with that. It would be good for you to outline the 30 methods you see as being called in FloatVector.get method. In general, I think we should be more focused on the compiled code once it has been optimized, not the methods. Have you looked at the assembly for this method that the JIT outputs? The get method should collapse to a very small number of instructions. If it isn't, we should address that. Have you done that analysis? Has disabling the bounds checking addressed the issue for you? > 3. Can we live without Arrow, and just directly access the off-heap memory > (e.g. by the UNSAFE instance)? > > I guess the answer is absolutely, yes. > Parth is doing this (bypassing Arrow API) with Drill, and this is exactly > what we are doing with Flink. My point is that, providing light-weight APIs > will make it easier to use Arrow. Without such APIs, Parth may need to > provide a library of Arrow wrappers in Drill, and we will need to provide a > library of Arrow wrappers in Flink, and so on. That's redundant work, and > it may reduce the popularity of Arrow. How are you going to come up with a set of APIs that protect the user or unroll checks? Or you just arguing that the user should not be protected?