Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

Fan Liya Thu, 09 May 2019 21:43:48 -0700

Hi Micah,

Thanks a lot for your comments. You solution sounds reasonable.
We have opened a JIRA yesterday (ARROW-5290
<https://issues.apache.org/jira/browse/ARROW-5290>), for the static boolean
wrapper you mentioned (This was also suggested by Jacques).


Hopefully, it will solve the problem.

Best,
Liya Fan

On Fri, May 10, 2019 at 12:11 PM Micah Kornfield <[email protected]>
wrote:

> Hi Liya Fan and Wes,
> TL;DR; I think we can either close
> https://issues.apache.org/jira/browse/ARROW-1833 or repurpose with a
> slightly different implementation proposed on one of the open pull requests
> [1].
>
> The new approach will add another final static boolean wrapper class (like
> the memory bounds checking [2]) to turn off the validation against the null
> bitmap ArrowBuf (via isSet) inside get* for use-cases that require the
> extra performance.
>
> This means no additional methods should need to be introduced.  Based on
> the numbers above it seem like this will have a non-trivial positive impact
> at the microbenchmark level.
>
> It is up to the caller to decide if they need to call isSet before (and can
> avoid it if the null-count is zero), but that is orthogonal.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/4258
> [2]
>
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
>
> On Thu, May 9, 2019 at 7:22 PM Fan Liya <[email protected]> wrote:
>
> > Hi Wes,
> >
> > I think the problem for ArrowBuf can be resolved by
> > disabling BoundsChecking.BOUNDS_CHECKING_ENABLED.
> > For example, this is the code of getInt:
> >
> >   public int getInt(int index) {
> >     chk(index, INT_SIZE);
> >     return PlatformDependent.getInt(addr(index));
> >   }
> >
> > The chk method makes bound check, which can be turned off by
> > BoundsChecking.BOUNDS_CHECKING_ENABLED.
> > I do not see null checking in ArrowBuf. Maybe you are talking about
> another
> > buffer class?
> >
> > Best,
> > Liya Fan
> >
> > On Thu, May 9, 2019 at 9:39 PM Wes McKinney <[email protected]> wrote:
> >
> > > It has also been previously suggested to add a get* method that
> > > returns the value in the ArrowBuf without null checking, like
> > > getDirty. See
> > >
> > > https://issues.apache.org/jira/browse/ARROW-1833
> > >
> > > Any thoughts about that?
> > >
> > > On Thu, May 9, 2019 at 4:54 AM niki.lj <[email protected]>
> > wrote:
> > > >
> > > > +1 on this proposal.
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > 发件人：Fan Liya <[email protected]>
> > > > 发送时间：2019年5月9日(星期四) 16:33
> > > > 收件人：dev <[email protected]>
> > > > 主 题：Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
> > > >
> > > > Hi all,
> > > >
> > > > Our previous results on micro-benchmarks show that, the original
> Arrow
> > > API
> > > > is 30% slower than the unsafe API.
> > > > After profiling, we found that, the performance overhead comes from
> the
> > > > null-checking in the get method. For example, the get method of
> > > > Float8Vector looks like this:
> > > >
> > > >   public double get(int index) throws IllegalStateException {
> > > >     if (isSet(index) == 0) {
> > > >       throw new IllegalStateException("Value at index is null");
> > > >     }
> > > >     return valueBuffer.getDouble(index * TYPE_WIDTH);
> > > >   }
> > > >
> > > > It first makes sure the value is not null, and then retrieve the
> value.
> > > >
> > > > In some cases, the first check is redundant, because the application
> > code
> > > > usually do the check before calling the get method. For such cases,
> the
> > > > first check can be skipped.
> > > > Therefore, @Jacques Nadeau suggests adding another flag to
> > enable/disable
> > > > such check. I think this is a good suggestion, because it solves the
> > > > performance problem, without introducing a new set of vector classes.
> > > What
> > > > do you think?
> > > >
> > > > I have opened a JIRA for that (ARROW-5290
> > > > <https://issues.apache.org/jira/browse/ARROW-5290>). Please give
> your
> > > > valuable comments.
> > > > Thanks a lot for your attention and valuable comments.
> > > > Special thanks to @Jacques Nadeau for all the suggestions and helpful
> > > > comments.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, May 8, 2019 at 1:05 PM Fan Liya <[email protected]>
> wrote:
> > > >
> > > > > Hi Jacques,
> > > > >
> > > > > Thanks a lot for your comments.
> > > > >
> > > > > I have evaluated the assembly code of original Arrow API, as well
> as
> > > the
> > > > > unsafe API in our PR <https://github.com/apache/arrow/pull/4212>
> > > > > Generally, the assembly code generated by JIT for both APIs are of
> > high
> > > > > quality, and for most cases, the assembly code are almost the same.
> > > > >
> > > > > However, some checks can be further removed. The following figures
> > > give an
> > > > > example (the figures are too big to be attached, so I have attached
> > > them in
> > > > > a JIRA comment. Please see comment
> > > > > <
> > >
> >
> https://issues.apache.org/jira/browse/ARROW-5200?focusedCommentId=16835303&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16835303
> > >.
> > > Sorry
> > > > > for the inconvenience):
> > > > >
> > > > > The first figure shows the code of original Arrow API, while the
> > second
> > > > > shows the code for the unsafe API.
> > > > > It can be observed that for the unsafe API, the amounts of the
> > source,
> > > > > byte and assembly code are all smaller. So it can be expected that
> > the
> > > > > performance of unsafe API is better.
> > > > >
> > > > > Concerning this particular example for the Float8Vector, I think it
> > is
> > > > > reasonable to further remove the check in the get method:
> > > > > Before we call the get method, we must check if the value is null,
> so
> > > the
> > > > > check in the get method is redundant. And this is a typical
> scenario
> > of
> > > > > using Arrow API (check and then get), at least for our scenario
> > (please
> > > > > take a glimpse of our benchmark in PR
> > > > > <https://github.com/apache/arrow/pull/4198>).
> > > > >
> > > > > Concerning the other problem, about the real algorithm in our
> > > scenario. I
> > > > > want to make two points:
> > > > >
> > > > > 1. SQL engines are performance critical, so 30% is a large number
> for
> > > us.
> > > > > For the past year, it took our team several months just to improve
> > the
> > > > > performance of our runtime engine by around 15%.
> > > > >
> > > > > 2. The performance of engine heavily depends on the performance of
> > > Arrow.
> > > > > Most SQL engines are memory-intensive, so the performance of
> get/set
> > > > > methods is the key. To get a flavor of the algorithms in our
> engine,
> > > please
> > > > > refer to PR <https://github.com/apache/arrow/pull/4198>. That is
> the
> > > core
> > > > > algorithm of our operator, which is executed many times during the
> > > > > processing of a SQL query. You can find that the computation is
> > > relatively
> > > > > simple, and most method calls are memory accesses.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Mon, May 6, 2019 at 5:52 PM Jacques Nadeau <[email protected]>
> > > wrote:
> > > > >
> > > > >> I am still asking the same question: can you please analyze the
> > > assembly
> > > > >> the JIT is producing and look to identify why the disabled bounds
> > > checking
> > > > >> is at 30% and what types of things we can do to address. For
> > example,
> > > we
> > > > >> have talked before about a bytecode transformer that simply
> removes
> > > the
> > > > >> bounds checking when loading Arrow if you want that behavior. If
> > > > >> necessary,
> > > > >> that may be a big win from a code maintenance standpoint over
> having
> > > > >> duplicate interfaces.
> > > > >>
> > > > >> The static block seems like a non-problem. You could simply load
> > > another
> > > > >> class that system property before loading any Arrow code. If
> you're
> > > > >> proposing a code change to solve your problem today, this seems
> just
> > > as
> > > > >> feasible.
> > > > >>
> > > > >> The other question: in a real algorithm, how much does that 30%
> > > matter?
> > > > >> Your benchmarks are entirely about this one call whereas real
> > > workloads
> > > > >> are
> > > > >> impacted by many things and the time in writing/reading vectors is
> > > > >> miniscule versus other things.
> > > > >>
> > > > >> On Mon, May 6, 2019 at 1:16 PM Fan Liya <[email protected]>
> > wrote:
> > > > >>
> > > > >> > Hi Jacques,
> > > > >> >
> > > > >> > Thank you so much for your kind reminder.
> > > > >> >
> > > > >> > To come up with some performance data, I have set up an
> > environment
> > > and
> > > > >> > run some micro-benchmarks.
> > > > >> > The server runs Linux, has 64 cores and has 256 GB memory.
> > > > >> > The benchmarks are simple iterations over some double vectors
> (the
> > > > >> source
> > > > >> > file is attached):
> > > > >> >
> > > > >> >   @Benchmark
> > > > >> >   @BenchmarkMode(Mode.AverageTime)
> > > > >> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> > > > >> >   public void testSafe() {
> > > > >> >     safeSum = 0;
> > > > >> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
> > > > >> >       safeVector.set(i, i + 10.0);
> > > > >> >       safeSum += safeVector.get(i);
> > > > >> >     }
> > > > >> >   }
> > > > >> >
> > > > >> >   @Benchmark
> > > > >> >   @BenchmarkMode(Mode.AverageTime)
> > > > >> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> > > > >> >   public void testUnSafe() {
> > > > >> >     unSafeSum = 0;
> > > > >> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
> > > > >> >       unsafeVector.set(i, i + 10.0);
> > > > >> >       unSafeSum += unsafeVector.get(i);
> > > > >> >     }
> > > > >> >   }
> > > > >> >
> > > > >> > The safe vector in the testSafe benchmark is from the original
> > Arrow
> > > > >> > implementation, whereas the unsafe vector in the testUnsafe
> > > benchmark is
> > > > >> > based on our initial implementation in PR
> > > > >> > <https://github.com/apache/arrow/pull/4212> (This is not the
> > final
> > > > >> > version. However, we believe much overhead has been removed).
> > > > >> > The evaluation is based on JMH framework (thanks to the
> suggestion
> > > from
> > > > >> > Jacques Nadeau). The benchmarks are run so many times by the
> > > framework
> > > > >> that
> > > > >> > the effects of JIT are well considered.
> > > > >> >
> > > > >> > In the first experiment, we use the default configuration
> > (boundary
> > > > >> > checking enabled), and the original Arrow vector is about 4
> times
> > > > >> slower:
> > > > >> >
> > > > >> > Benchmark                       Mode  Cnt   Score   Error  Units
> > > > >> > VectorAPIBenchmarks.testSafe    avgt    5  11.546 ± 0.012  us/op
> > > > >> > VectorAPIBenchmarks.testUnSafe  avgt    5   2.822 ± 0.006  us/op
> > > > >> >
> > > > >> > In the second experiment, we disable the boundary checking by
> JVM
> > > > >> options:
> > > > >> >
> > > > >> > -Ddrill.enable_unsafe_memory_access=true
> > > > >> > -Darrow.enable_unsafe_memory_access=true
> > > > >> >
> > > > >> > This time, the original Arrow vector is about 30% slower:
> > > > >> >
> > > > >> > Benchmark                       Mode  Cnt  Score   Error  Units
> > > > >> > VectorAPIBenchmarks.testSafe    avgt    5  4.069 ± 0.004  us/op
> > > > >> > VectorAPIBenchmarks.testUnSafe  avgt    5  2.819 ± 0.005  us/op
> > > > >> >
> > > > >> > This is a significant improvement, about 2.84x faster compared
> to
> > > when
> > > > >> > bound checking is enabled.
> > > > >> > However, in our scenario, we would still chose to bypass Arrow
> > APIs
> > > > >> > without hesitation, because such memory accesses are so frequent
> > > > >> > operations, that a 30% performance degradation will easily cause
> > us
> > > lose
> > > > >> > edge.
> > > > >> >
> > > > >> > The results can be attributed to the following factors:
> > > > >> > 1. Although the checks have been disabled, we still need to read
> > the
> > > > >> flag
> > > > >> > and check it repeatedly in the Arrow APIs, which accumulates to
> > > large
> > > > >> > performance overhead.
> > > > >> > 2. There is too much code in the call stacks, compared with the
> > > unsafe
> > > > >> > API. This will lead to less efficient i-cache, even if JIT can
> > > avoids
> > > > >> the
> > > > >> > cost of stack frames by in-lining most method code.
> > > > >> >
> > > > >> > Another, maybe separate problem is that, the
> > > > >> > flag BoundsChecking#BOUNDS_CHECKING_ENABLED is final and
> > > initialized in
> > > > >> a
> > > > >> > static block. That means the only reliable way to override it is
> > to
> > > > >> > override system properties in the JVM command line. However, for
> > > some
> > > > >> > scenarios, we do not have access to the command line (e.g.
> running
> > > > >> Flink in
> > > > >> > Yarn). I think this deserves a separate issue.
> > > > >> >
> > > > >> > Best,
> > > > >> > Liya Fan
> > > > >> >
> > > > >> > On Mon, May 6, 2019 at 1:23 PM Jacques Nadeau <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> >> >
> > > > >> >> > Maybe I need to take a closer look at how the other SQL
> engines
> > > are
> > > > >> >> using
> > > > >> >> > Arrow. To see if they are also bypassing Arrow APIs.
> > > > >> >> > I agree that a random user should be able to protect
> > themselves,
> > > and
> > > > >> >> this
> > > > >> >> > is the utmost priority.
> > > > >> >> >
> > > > >> >> > According to my experience in Flink, JIT cannot optimize away
> > the
> > > > >> >> checks,
> > > > >> >> > and removing the checks addresses the issue.
> > > > >> >> > I want to illustrate this from two points:
> > > > >> >> >
> > > > >> >> > 1. Theoretical view point: JIT makes optimizations without
> > > changing
> > > > >> >> > semantics of the code, so it can never remove the checks
> > without
> > > > >> >> changing
> > > > >> >> > code semantics. To make it simple, if the JIT has witness the
> > > engine
> > > > >> >> > successfully processed 1,000,000 records, how can it be sure
> > > that the
> > > > >> >> > 1,000,001th record will be successful?
> > > > >> >> >
> > > > >> >> > 2. Practical view point: we have evaluated our SQL engine on
> > > TPC-H
> > > > >> 1TB
> > > > >> >> data
> > > > >> >> > set. This is really a large number of records. So the JIT
> must
> > > have
> > > > >> done
> > > > >> >> > all it could to improve the code. According to the
> performance
> > > > >> results,
> > > > >> >> > however, it could not eliminate the impact caused checks.
> > > > >> >> >
> > > > >> >>
> > > > >> >> I don't think you're following my point. There are two
> different
> > > > >> points it
> > > > >> >> seems like you want to discuss. Let's evaluate each separately:
> > > > >> >>
> > > > >> >> 1) Bounds checking for safety
> > > > >> >> 2) Supposed inefficiency of the call hierarchy.
> > > > >> >>
> > > > >> >> For #1 we provide a system level property that can disable
> these.
> > > The
> > > > >> JVM
> > > > >> >> should succesfully optimize away this operation if that flag is
> > > set.
> > > > >> >> Please
> > > > >> >> look at the JIT output to confirm whether this is true.
> > > > >> >>
> > > > >> >> For #2: We designed things to collapse so the call hierarchy
> > > shouldn't
> > > > >> be
> > > > >> >> a
> > > > >> >> problem. Please look at the JIT output to confirm.
> > > > >> >>
> > > > >> >> Please come with data around #1 and #2 to make an argument for
> a
> > > set of
> > > > >> >> changes.
> > > > >> >>
> > > > >> >> thanks
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > >
> >
>

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

Reply via email to