[ 
https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9965:
----------------------------------
    Labels: pull-request-available  (was: )

> [Java] Buffer capacity calculations are slow for fixed-width vectors
> --------------------------------------------------------------------
>
>                 Key: ARROW-9965
>                 URL: https://issues.apache.org/jira/browse/ARROW-9965
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Josiah
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>         Attachments: after_patch_profile_prof_perfasm_unsafe_true, 
> before_patch_profile_prof_perfasm_unsafe_true
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It turns out that setSafe performs a very expensive integer division when 
> trying to compute buffer capacity; specifically, it divides by the field 
> size, which isn't hardcoded. Although it is typically a power of 2 for 
> alignment reasons, this doesn't compile down to a bitshift.
> This is done here: 
> https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
>  
> Forcing a bitshift operation results in a large speedup in benchmarks. When 
> turning off bounds checks (which affects another portion of set), 
> microbenchmarks indicate that setting the elements of a vector via setSafe is 
> increased by ~174% (almost 3 times faster). With bounds checks on, this is 
> reduced to a 73% increase (Amdahl's).
> We use setSafe right now in a hot loop to set Arrow vectors in an internal 
> data-intensive service (for now), although in the future, we would prefer a 
> more specialized vector append interface to skip all the other indirection 
> and bit manipulation instructions, while not directly manipulating the 
> exposed (native) memory.
> Here is the detailed analysis:
> Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run 
> with the latest repo-provided JDK 14 on Ubuntu 20.04.
> {code}
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT 
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
> IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
> IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
> IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT 
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
> IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
> IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
> IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
> {code}
> I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I 
> left the CPU frequency scaling in the default state. The numbers seemed off 
> for setting a value. I reran the benchmarks with the `prof=perfasm` option in 
> JMH, which emitted annotated assembly for detected hotspots. Here is the 
> relevant section:
> {code}
> 0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit 
> exception: dispatches to 0x00007f5a7f7bef28
> │ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
> │ ││ ; - 
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 
> (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 
> (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 
> 77)
> │ ││ ; - 
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
>  (line 234)
> 0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l 
> {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; - 
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 
> (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 
> (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 
> 77)
> │ ││ ; - 
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
>  (line 234)
> *snip*
> 1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 
> rethrow=0 return_oop=0}
> ││ │ ││ ; - 
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 
> (line 189)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 
> (line 185)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 
> 817)
> ││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 
> (line 77)
> ││ │ ││ ; - 
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
>  (line 234)
> 68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
> ││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 
> return_oop=0}
> ││ ││ ; - java.lang.Math::min@3 (line 1552)
> ││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 
> 44)
> ││ ││ ; - 
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 
> (line 189)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 
> (line 185)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 
> 817)
> ││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 
> 77)
> ││ ││ ; - 
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
>  (line 234)
> {code}
> The hot instruction is misattributed, probably due to event instruction skid. 
> But integer division is known to be expensive to implement. We can verify 
> this with Agner Fog's instruction tables: 
> https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv 
> gives high numbers in the table provided, as expected.
> After noting all of this, we can apply the following patch, which produces 
> the speedup as above:
> {code}
> diff --git 
> a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java 
> b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> index ee47f6dd8..0c9a57bf9 100644
> --- 
> a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> +++ 
> b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> @@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
> public abstract class BaseFixedWidthVector extends BaseValueVector
> implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
> private final int typeWidth;
> + private final int typeWidthAsExactBitShiftOrNegOne;
> protected int lastValueCapacity;
> @@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends 
> BaseValueVector
> public BaseFixedWidthVector(Field field, final BufferAllocator allocator, 
> final int typeWidth) {
> super(allocator);
> this.typeWidth = typeWidth;
> + this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
> this.field = field;
> valueCount = 0;
> allocationMonitor = 0;
> @@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends 
> BaseValueVector
> }
> private int getValueBufferValueCapacity() {
> - return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + if (typeWidthAsExactBitShiftOrNegOne == -1) {
> + // Slow path - integral division is very very expensive, and code here is 
> part of
> + // setSafe's hot loop
> + // The JVM did not optimize integral division into bit shifts
> + return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + }
> + return capAtMaxInt(valueBuffer.capacity() >> 
> typeWidthAsExactBitShiftOrNegOne);
> }
> private int getValidityBufferValueCapacity() {
> @@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends 
> BaseValueVector
> return visitor.visit(this, value);
> }
> + private int log2ExactOrNeg1(int x) {
> + final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
> + if (!isPowerOfTwo) {
> + return -1;
> + }
> + return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
> + }
> +
> }
> {code}
> Attached is the generated assembly as printed by JMH, before and after. I 
> renamed some variables for clarity after generating the results, but the 
> logic is unchanged.
> I also did a quick test with JDK 8 - this was where I originally ran the 
> benchmarks. The idiv instruction was present there too.
> An initial version of this patch cached the value - this produces about the 
> same speedup.
> Are people fine with this approach?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to