[ https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-9965: ---------------------------------- Labels: pull-request-available (was: ) > [Java] Buffer capacity calculations are slow for fixed-width vectors > -------------------------------------------------------------------- > > Key: ARROW-9965 > URL: https://issues.apache.org/jira/browse/ARROW-9965 > Project: Apache Arrow > Issue Type: Improvement > Components: Java > Reporter: Josiah > Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: after_patch_profile_prof_perfasm_unsafe_true, > before_patch_profile_prof_perfasm_unsafe_true > > Time Spent: 10m > Remaining Estimate: 0h > > It turns out that setSafe performs a very expensive integer division when > trying to compute buffer capacity; specifically, it divides by the field > size, which isn't hardcoded. Although it is typically a power of 2 for > alignment reasons, this doesn't compile down to a bitshift. > This is done here: > https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189 > > Forcing a bitshift operation results in a large speedup in benchmarks. When > turning off bounds checks (which affects another portion of set), > microbenchmarks indicate that setting the elements of a vector via setSafe is > increased by ~174% (almost 3 times faster). With bounds checks on, this is > reduced to a 73% increase (Amdahl's). > We use setSafe right now in a hot loop to set Arrow vectors in an internal > data-intensive service (for now), although in the future, we would prefer a > more specialized vector append interface to skip all the other indirection > and bit manipulation instructions, while not directly manipulating the > exposed (native) memory. > Here is the detailed analysis: > Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run > with the latest repo-provided JDK 14 on Ubuntu 20.04. > {code} > Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT > applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=false > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly > *snip* > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op > IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op > IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op > Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=false > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly > *snip* > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op > IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op > IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op > Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT > applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=true > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op > IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op > IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op > Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied > # JMH version: 1.21 > # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 > # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java > # VM options: -Darrow.enable_unsafe_memory_access=true > # Warmup: 5 iterations, 10 s each > # Measurement: 5 iterations, 10 s each > # Timeout: 10 min per iteration > # Threads: 1 thread, will synchronize iterations > # Benchmark mode: Average time, time/op > # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly > Benchmark Mode Cnt Score Error Units > IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op > IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op > IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op > {code} > I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I > left the CPU frequency scaling in the default state. The numbers seemed off > for setting a value. I reran the benchmarks with the `prof=perfasm` option in > JMH, which emitted annotated assembly for detected hotspots. Here is the > relevant section: > {code} > 0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit > exception: dispatches to 0x00007f5a7f7bef28 > │ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0} > │ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138) > │ ││ ; - > org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 > (line 189) > │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 > (line 185) > │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817) > │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223) > │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line > 77) > │ ││ ; - > org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 > (line 234) > 0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l > {reexecute=0 rethrow=0 return_oop=0} > │ ││ ; - > org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 > (line 189) > │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 > (line 185) > │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817) > │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223) > │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line > 77) > │ ││ ; - > org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 > (line 234) > *snip* > 1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 > rethrow=0 return_oop=0} > ││ │ ││ ; - > org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 > (line 189) > ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 > (line 185) > ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line > 817) > ││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223) > ││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 > (line 77) > ││ │ ││ ; - > org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 > (line 234) > 68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax > ││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 > return_oop=0} > ││ ││ ; - java.lang.Math::min@3 (line 1552) > ││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line > 44) > ││ ││ ; - > org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 > (line 189) > ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 > (line 185) > ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line > 817) > ││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223) > ││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line > 77) > ││ ││ ; - > org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 > (line 234) > {code} > The hot instruction is misattributed, probably due to event instruction skid. > But integer division is known to be expensive to implement. We can verify > this with Agner Fog's instruction tables: > https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv > gives high numbers in the table provided, as expected. > After noting all of this, we can apply the following patch, which produces > the speedup as above: > {code} > diff --git > a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java > b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java > index ee47f6dd8..0c9a57bf9 100644 > --- > a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java > +++ > b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java > @@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent; > public abstract class BaseFixedWidthVector extends BaseValueVector > implements FixedWidthVector, FieldVector, VectorDefinitionSetter { > private final int typeWidth; > + private final int typeWidthAsExactBitShiftOrNegOne; > protected int lastValueCapacity; > @@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends > BaseValueVector > public BaseFixedWidthVector(Field field, final BufferAllocator allocator, > final int typeWidth) { > super(allocator); > this.typeWidth = typeWidth; > + this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth); > this.field = field; > valueCount = 0; > allocationMonitor = 0; > @@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends > BaseValueVector > } > private int getValueBufferValueCapacity() { > - return capAtMaxInt(valueBuffer.capacity() / typeWidth); > + if (typeWidthAsExactBitShiftOrNegOne == -1) { > + // Slow path - integral division is very very expensive, and code here is > part of > + // setSafe's hot loop > + // The JVM did not optimize integral division into bit shifts > + return capAtMaxInt(valueBuffer.capacity() / typeWidth); > + } > + return capAtMaxInt(valueBuffer.capacity() >> > typeWidthAsExactBitShiftOrNegOne); > } > private int getValidityBufferValueCapacity() { > @@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends > BaseValueVector > return visitor.visit(this, value); > } > + private int log2ExactOrNeg1(int x) { > + final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0; > + if (!isPowerOfTwo) { > + return -1; > + } > + return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x); > + } > + > } > {code} > Attached is the generated assembly as printed by JMH, before and after. I > renamed some variables for clarity after generating the results, but the > logic is unchanged. > I also did a quick test with JDK 8 - this was where I originally ran the > benchmarks. The idiv instruction was present there too. > An initial version of this patch cached the value - this produces about the > same speedup. > Are people fine with this approach? -- This message was sent by Atlassian Jira (v8.3.4#803005)