Hi Micah, Thanks a lot for your comments. You solution sounds reasonable. We have opened a JIRA yesterday (ARROW-5290 <https://issues.apache.org/jira/browse/ARROW-5290>), for the static boolean wrapper you mentioned (This was also suggested by Jacques).
Hopefully, it will solve the problem. Best, Liya Fan On Fri, May 10, 2019 at 12:11 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Liya Fan and Wes, > TL;DR; I think we can either close > https://issues.apache.org/jira/browse/ARROW-1833 or repurpose with a > slightly different implementation proposed on one of the open pull requests > [1]. > > The new approach will add another final static boolean wrapper class (like > the memory bounds checking [2]) to turn off the validation against the null > bitmap ArrowBuf (via isSet) inside get* for use-cases that require the > extra performance. > > This means no additional methods should need to be introduced. Based on > the numbers above it seem like this will have a non-trivial positive impact > at the microbenchmark level. > > It is up to the caller to decide if they need to call isSet before (and can > avoid it if the null-count is zero), but that is orthogonal. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/4258 > [2] > > https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java > > On Thu, May 9, 2019 at 7:22 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > Hi Wes, > > > > I think the problem for ArrowBuf can be resolved by > > disabling BoundsChecking.BOUNDS_CHECKING_ENABLED. > > For example, this is the code of getInt: > > > > public int getInt(int index) { > > chk(index, INT_SIZE); > > return PlatformDependent.getInt(addr(index)); > > } > > > > The chk method makes bound check, which can be turned off by > > BoundsChecking.BOUNDS_CHECKING_ENABLED. > > I do not see null checking in ArrowBuf. Maybe you are talking about > another > > buffer class? > > > > Best, > > Liya Fan > > > > On Thu, May 9, 2019 at 9:39 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > It has also been previously suggested to add a get* method that > > > returns the value in the ArrowBuf without null checking, like > > > getDirty. See > > > > > > https://issues.apache.org/jira/browse/ARROW-1833 > > > > > > Any thoughts about that? > > > > > > On Thu, May 9, 2019 at 4:54 AM niki.lj <niki...@aliyun.com.invalid> > > wrote: > > > > > > > > +1 on this proposal. > > > > > > > > > > > > ------------------------------------------------------------------ > > > > 发件人:Fan Liya <liya.fa...@gmail.com> > > > > 发送时间:2019年5月9日(星期四) 16:33 > > > > 收件人:dev <dev@arrow.apache.org> > > > > 主 题:Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow > > > > > > > > Hi all, > > > > > > > > Our previous results on micro-benchmarks show that, the original > Arrow > > > API > > > > is 30% slower than the unsafe API. > > > > After profiling, we found that, the performance overhead comes from > the > > > > null-checking in the get method. For example, the get method of > > > > Float8Vector looks like this: > > > > > > > > public double get(int index) throws IllegalStateException { > > > > if (isSet(index) == 0) { > > > > throw new IllegalStateException("Value at index is null"); > > > > } > > > > return valueBuffer.getDouble(index * TYPE_WIDTH); > > > > } > > > > > > > > It first makes sure the value is not null, and then retrieve the > value. > > > > > > > > In some cases, the first check is redundant, because the application > > code > > > > usually do the check before calling the get method. For such cases, > the > > > > first check can be skipped. > > > > Therefore, @Jacques Nadeau suggests adding another flag to > > enable/disable > > > > such check. I think this is a good suggestion, because it solves the > > > > performance problem, without introducing a new set of vector classes. > > > What > > > > do you think? > > > > > > > > I have opened a JIRA for that (ARROW-5290 > > > > <https://issues.apache.org/jira/browse/ARROW-5290>). Please give > your > > > > valuable comments. > > > > Thanks a lot for your attention and valuable comments. > > > > Special thanks to @Jacques Nadeau for all the suggestions and helpful > > > > comments. > > > > > > > > Best, > > > > Liya Fan > > > > > > > > > > > > > > > > > > > > On Wed, May 8, 2019 at 1:05 PM Fan Liya <liya.fa...@gmail.com> > wrote: > > > > > > > > > Hi Jacques, > > > > > > > > > > Thanks a lot for your comments. > > > > > > > > > > I have evaluated the assembly code of original Arrow API, as well > as > > > the > > > > > unsafe API in our PR <https://github.com/apache/arrow/pull/4212> > > > > > Generally, the assembly code generated by JIT for both APIs are of > > high > > > > > quality, and for most cases, the assembly code are almost the same. > > > > > > > > > > However, some checks can be further removed. The following figures > > > give an > > > > > example (the figures are too big to be attached, so I have attached > > > them in > > > > > a JIRA comment. Please see comment > > > > > < > > > > > > https://issues.apache.org/jira/browse/ARROW-5200?focusedCommentId=16835303&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16835303 > > >. > > > Sorry > > > > > for the inconvenience): > > > > > > > > > > The first figure shows the code of original Arrow API, while the > > second > > > > > shows the code for the unsafe API. > > > > > It can be observed that for the unsafe API, the amounts of the > > source, > > > > > byte and assembly code are all smaller. So it can be expected that > > the > > > > > performance of unsafe API is better. > > > > > > > > > > Concerning this particular example for the Float8Vector, I think it > > is > > > > > reasonable to further remove the check in the get method: > > > > > Before we call the get method, we must check if the value is null, > so > > > the > > > > > check in the get method is redundant. And this is a typical > scenario > > of > > > > > using Arrow API (check and then get), at least for our scenario > > (please > > > > > take a glimpse of our benchmark in PR > > > > > <https://github.com/apache/arrow/pull/4198>). > > > > > > > > > > Concerning the other problem, about the real algorithm in our > > > scenario. I > > > > > want to make two points: > > > > > > > > > > 1. SQL engines are performance critical, so 30% is a large number > for > > > us. > > > > > For the past year, it took our team several months just to improve > > the > > > > > performance of our runtime engine by around 15%. > > > > > > > > > > 2. The performance of engine heavily depends on the performance of > > > Arrow. > > > > > Most SQL engines are memory-intensive, so the performance of > get/set > > > > > methods is the key. To get a flavor of the algorithms in our > engine, > > > please > > > > > refer to PR <https://github.com/apache/arrow/pull/4198>. That is > the > > > core > > > > > algorithm of our operator, which is executed many times during the > > > > > processing of a SQL query. You can find that the computation is > > > relatively > > > > > simple, and most method calls are memory accesses. > > > > > > > > > > Best, > > > > > Liya Fan > > > > > > > > > > On Mon, May 6, 2019 at 5:52 PM Jacques Nadeau <jacq...@apache.org> > > > wrote: > > > > > > > > > >> I am still asking the same question: can you please analyze the > > > assembly > > > > >> the JIT is producing and look to identify why the disabled bounds > > > checking > > > > >> is at 30% and what types of things we can do to address. For > > example, > > > we > > > > >> have talked before about a bytecode transformer that simply > removes > > > the > > > > >> bounds checking when loading Arrow if you want that behavior. If > > > > >> necessary, > > > > >> that may be a big win from a code maintenance standpoint over > having > > > > >> duplicate interfaces. > > > > >> > > > > >> The static block seems like a non-problem. You could simply load > > > another > > > > >> class that system property before loading any Arrow code. If > you're > > > > >> proposing a code change to solve your problem today, this seems > just > > > as > > > > >> feasible. > > > > >> > > > > >> The other question: in a real algorithm, how much does that 30% > > > matter? > > > > >> Your benchmarks are entirely about this one call whereas real > > > workloads > > > > >> are > > > > >> impacted by many things and the time in writing/reading vectors is > > > > >> miniscule versus other things. > > > > >> > > > > >> On Mon, May 6, 2019 at 1:16 PM Fan Liya <liya.fa...@gmail.com> > > wrote: > > > > >> > > > > >> > Hi Jacques, > > > > >> > > > > > >> > Thank you so much for your kind reminder. > > > > >> > > > > > >> > To come up with some performance data, I have set up an > > environment > > > and > > > > >> > run some micro-benchmarks. > > > > >> > The server runs Linux, has 64 cores and has 256 GB memory. > > > > >> > The benchmarks are simple iterations over some double vectors > (the > > > > >> source > > > > >> > file is attached): > > > > >> > > > > > >> > @Benchmark > > > > >> > @BenchmarkMode(Mode.AverageTime) > > > > >> > @OutputTimeUnit(TimeUnit.MICROSECONDS) > > > > >> > public void testSafe() { > > > > >> > safeSum = 0; > > > > >> > for (int i = 0; i < VECTOR_LENGTH; i++) { > > > > >> > safeVector.set(i, i + 10.0); > > > > >> > safeSum += safeVector.get(i); > > > > >> > } > > > > >> > } > > > > >> > > > > > >> > @Benchmark > > > > >> > @BenchmarkMode(Mode.AverageTime) > > > > >> > @OutputTimeUnit(TimeUnit.MICROSECONDS) > > > > >> > public void testUnSafe() { > > > > >> > unSafeSum = 0; > > > > >> > for (int i = 0; i < VECTOR_LENGTH; i++) { > > > > >> > unsafeVector.set(i, i + 10.0); > > > > >> > unSafeSum += unsafeVector.get(i); > > > > >> > } > > > > >> > } > > > > >> > > > > > >> > The safe vector in the testSafe benchmark is from the original > > Arrow > > > > >> > implementation, whereas the unsafe vector in the testUnsafe > > > benchmark is > > > > >> > based on our initial implementation in PR > > > > >> > <https://github.com/apache/arrow/pull/4212> (This is not the > > final > > > > >> > version. However, we believe much overhead has been removed). > > > > >> > The evaluation is based on JMH framework (thanks to the > suggestion > > > from > > > > >> > Jacques Nadeau). The benchmarks are run so many times by the > > > framework > > > > >> that > > > > >> > the effects of JIT are well considered. > > > > >> > > > > > >> > In the first experiment, we use the default configuration > > (boundary > > > > >> > checking enabled), and the original Arrow vector is about 4 > times > > > > >> slower: > > > > >> > > > > > >> > Benchmark Mode Cnt Score Error Units > > > > >> > VectorAPIBenchmarks.testSafe avgt 5 11.546 ± 0.012 us/op > > > > >> > VectorAPIBenchmarks.testUnSafe avgt 5 2.822 ± 0.006 us/op > > > > >> > > > > > >> > In the second experiment, we disable the boundary checking by > JVM > > > > >> options: > > > > >> > > > > > >> > -Ddrill.enable_unsafe_memory_access=true > > > > >> > -Darrow.enable_unsafe_memory_access=true > > > > >> > > > > > >> > This time, the original Arrow vector is about 30% slower: > > > > >> > > > > > >> > Benchmark Mode Cnt Score Error Units > > > > >> > VectorAPIBenchmarks.testSafe avgt 5 4.069 ± 0.004 us/op > > > > >> > VectorAPIBenchmarks.testUnSafe avgt 5 2.819 ± 0.005 us/op > > > > >> > > > > > >> > This is a significant improvement, about 2.84x faster compared > to > > > when > > > > >> > bound checking is enabled. > > > > >> > However, in our scenario, we would still chose to bypass Arrow > > APIs > > > > >> > without hesitation, because such memory accesses are so frequent > > > > >> > operations, that a 30% performance degradation will easily cause > > us > > > lose > > > > >> > edge. > > > > >> > > > > > >> > The results can be attributed to the following factors: > > > > >> > 1. Although the checks have been disabled, we still need to read > > the > > > > >> flag > > > > >> > and check it repeatedly in the Arrow APIs, which accumulates to > > > large > > > > >> > performance overhead. > > > > >> > 2. There is too much code in the call stacks, compared with the > > > unsafe > > > > >> > API. This will lead to less efficient i-cache, even if JIT can > > > avoids > > > > >> the > > > > >> > cost of stack frames by in-lining most method code. > > > > >> > > > > > >> > Another, maybe separate problem is that, the > > > > >> > flag BoundsChecking#BOUNDS_CHECKING_ENABLED is final and > > > initialized in > > > > >> a > > > > >> > static block. That means the only reliable way to override it is > > to > > > > >> > override system properties in the JVM command line. However, for > > > some > > > > >> > scenarios, we do not have access to the command line (e.g. > running > > > > >> Flink in > > > > >> > Yarn). I think this deserves a separate issue. > > > > >> > > > > > >> > Best, > > > > >> > Liya Fan > > > > >> > > > > > >> > On Mon, May 6, 2019 at 1:23 PM Jacques Nadeau < > jacq...@apache.org > > > > > > > >> wrote: > > > > >> > > > > > >> >> > > > > > >> >> > Maybe I need to take a closer look at how the other SQL > engines > > > are > > > > >> >> using > > > > >> >> > Arrow. To see if they are also bypassing Arrow APIs. > > > > >> >> > I agree that a random user should be able to protect > > themselves, > > > and > > > > >> >> this > > > > >> >> > is the utmost priority. > > > > >> >> > > > > > >> >> > According to my experience in Flink, JIT cannot optimize away > > the > > > > >> >> checks, > > > > >> >> > and removing the checks addresses the issue. > > > > >> >> > I want to illustrate this from two points: > > > > >> >> > > > > > >> >> > 1. Theoretical view point: JIT makes optimizations without > > > changing > > > > >> >> > semantics of the code, so it can never remove the checks > > without > > > > >> >> changing > > > > >> >> > code semantics. To make it simple, if the JIT has witness the > > > engine > > > > >> >> > successfully processed 1,000,000 records, how can it be sure > > > that the > > > > >> >> > 1,000,001th record will be successful? > > > > >> >> > > > > > >> >> > 2. Practical view point: we have evaluated our SQL engine on > > > TPC-H > > > > >> 1TB > > > > >> >> data > > > > >> >> > set. This is really a large number of records. So the JIT > must > > > have > > > > >> done > > > > >> >> > all it could to improve the code. According to the > performance > > > > >> results, > > > > >> >> > however, it could not eliminate the impact caused checks. > > > > >> >> > > > > > >> >> > > > > >> >> I don't think you're following my point. There are two > different > > > > >> points it > > > > >> >> seems like you want to discuss. Let's evaluate each separately: > > > > >> >> > > > > >> >> 1) Bounds checking for safety > > > > >> >> 2) Supposed inefficiency of the call hierarchy. > > > > >> >> > > > > >> >> For #1 we provide a system level property that can disable > these. > > > The > > > > >> JVM > > > > >> >> should succesfully optimize away this operation if that flag is > > > set. > > > > >> >> Please > > > > >> >> look at the JIT output to confirm whether this is true. > > > > >> >> > > > > >> >> For #2: We designed things to collapse so the call hierarchy > > > shouldn't > > > > >> be > > > > >> >> a > > > > >> >> problem. Please look at the JIT output to confirm. > > > > >> >> > > > > >> >> Please come with data around #1 and #2 to make an argument for > a > > > set of > > > > >> >> changes. > > > > >> >> > > > > >> >> thanks > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > >