[
https://issues.apache.org/jira/browse/ARROW-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josiah updated ARROW-9965:
--------------------------
Attachment: after_patch_profile_prof_perfasm_unsafe_true
before_patch_profile_prof_perfasm_unsafe_true
Description:
It turns out that setSafe performs a very expensive integer division when
trying to compute buffer capacity; specifically, it divides by the field size,
which isn't hardcoded. Although it is typically a power of 2 for alignment
reasons, this doesn't compile down to a bitshift.
This is done here:
https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
Forcing a bitshift operation results in a large speedup in benchmarks. When
turning off bounds checks (which affects another portion of set),
microbenchmarks indicate that setting the elements of a vector via setSafe is
increased by ~174% (almost 3 times faster). With bounds checks on, this is
reduced to a 73% increase (Amdahl's).
We use setSafe right now in a hot loop to set Arrow vectors in an internal
data-intensive service (for now), although in the future, we would prefer a
more specialized vector append interface to skip all the other indirection and
bit manipulation instructions, while not directly manipulating the exposed
(native) memory.
Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run
with the latest repo-provided JDK 14 on Ubuntu 20.04.
{code}
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT
applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
{code}
I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I
left the CPU frequency scaling in the default state. The numbers seemed off for
setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH,
which emitted annotated assembly for detected hotspots. Here is the relevant
section:
{code}
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception:
dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4
(line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line
185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0
rethrow=0 return_oop=0}
│ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11
(line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line
185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0
return_oop=0}
││ │ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12
(line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
(line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line
817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line
77)
││ │ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0
return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13
(line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
(line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
{code}
The hot instruction is misattributed, probably due to event instruction skid.
But integer division is known to be expensive to implement. We can verify this
with Agner Fog's instruction tables:
https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv
gives high numbers in the table provided, as expected.
After noting all of this, we can apply the following patch, which produces the
speedup as above:
{code}
diff --git
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
---
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;
protected int lastValueCapacity;
@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final
int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
}
private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is
part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >>
typeWidthAsExactBitShiftOrNegOne);
}
private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
return visitor.visit(this, value);
}
+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}
{code}
Attached is the generated assembly as printed by JMH, before and after. I
renamed some variables for clarity after generating the results, but the logic
is unchanged.
I also did a quick test with JDK 8 - this was where I originally ran the
benchmarks. The idiv instruction was present there too.
An initial version of this patch cached the value - this produces about the
same speedup.
Are people fine with this approach?
was:
It turns out that setSafe performs a very expensive integer division when
trying to compute buffer capacity; specifically, it divides by the field size,
which isn't hardcoded. Although it is typically a power of 2 for alignment
reasons, this doesn't compile down to a bitshift.
This is done here:
https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
Forcing a bitshift operation results in a large speedup in benchmarks. When
turning off bounds checks (which affects another portion of set),
microbenchmarks indicate that setting the elements of a vector via setSafe is
increased by ~174% (almost 3 times faster). With bounds checks on, this is
reduced to a 73% increase (Amdahl's).
We use setSafe right now in a hot loop to set Arrow vectors in an internal
data-intensive service (for now), although in the future, we would prefer a
more specialized vector append interface to skip all the other indirection and
bit manipulation instructions, while not directly manipulating the exposed
(native) memory.
Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run
with the latest repo-provided JDK 14 on Ubuntu 20.04.
{code}
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT
applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
{code}
I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I
left the CPU frequency scaling in the default state. The numbers seemed off for
setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH,
which emitted annotated assembly for detected hotspots. Here is the relevant
section:
{code}
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception:
dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4
(line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line
185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0
rethrow=0 return_oop=0}
│ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11
(line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line
185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0
return_oop=0}
││ │ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12
(line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
(line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line
817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line
77)
││ │ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0
return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; -
org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13
(line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
(line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; -
org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
(line 234)
{code}
The hot instruction is misattributed, probably due to event instruction skid.
But integer division is known to be expensive to implement. We can verify this
with Agner Fog's instruction tables:
https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv
gives high numbers in the table provided, as expected.
After noting all of this, we can apply the following patch, which produces the
speedup as above:
{code}
diff --git
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
---
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;
protected int lastValueCapacity;
@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final
int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
}
private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is
part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >>
typeWidthAsExactBitShiftOrNegOne);
}
private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
return visitor.visit(this, value);
}
+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}
{code}
I also did a quick test with JDK 8 - this was where I originally ran the
benchmarks. The idiv instruction was present there too.
An initial version of this patch cached the value - this produces about the
same speedup.
Are people fine with this approach?
> [Java] Buffer capacity calculations are slow for fixed-width vectors
> --------------------------------------------------------------------
>
> Key: ARROW-9965
> URL: https://issues.apache.org/jira/browse/ARROW-9965
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Josiah
> Priority: Minor
> Attachments: after_patch_profile_prof_perfasm_unsafe_true,
> before_patch_profile_prof_perfasm_unsafe_true
>
>
> It turns out that setSafe performs a very expensive integer division when
> trying to compute buffer capacity; specifically, it divides by the field
> size, which isn't hardcoded. Although it is typically a power of 2 for
> alignment reasons, this doesn't compile down to a bitshift.
> This is done here:
> https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
>
> Forcing a bitshift operation results in a large speedup in benchmarks. When
> turning off bounds checks (which affects another portion of set),
> microbenchmarks indicate that setting the elements of a vector via setSafe is
> increased by ~174% (almost 3 times faster). With bounds checks on, this is
> reduced to a 73% increase (Amdahl's).
> We use setSafe right now in a hot loop to set Arrow vectors in an internal
> data-intensive service (for now), although in the future, we would prefer a
> more specialized vector append interface to skip all the other indirection
> and bit manipulation instructions, while not directly manipulating the
> exposed (native) memory.
> Here is the detailed analysis:
> Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run
> with the latest repo-provided JDK 14 on Ubuntu 20.04.
> {code}
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
> IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
> IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=false
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> *snip*
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
> IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT
> applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
> IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
> IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op
> Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
> # JMH version: 1.21
> # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
> # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
> # VM options: -Darrow.enable_unsafe_memory_access=true
> # Warmup: 5 iterations, 10 s each
> # Measurement: 5 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
> IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
> IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
> {code}
> I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I
> left the CPU frequency scaling in the default state. The numbers seemed off
> for setting a value. I reran the benchmarks with the `prof=perfasm` option in
> JMH, which emitted annotated assembly for detected hotspots. Here is the
> relevant section:
> {code}
> 0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit
> exception: dispatches to 0x00007f5a7f7bef28
> │ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
> │ ││ ; -
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4
> (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
> (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line
> 77)
> │ ││ ; -
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
> (line 234)
> 0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l
> {reexecute=0 rethrow=0 return_oop=0}
> │ ││ ; -
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11
> (line 189)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
> (line 185)
> │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
> │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line
> 77)
> │ ││ ; -
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
> (line 234)
> *snip*
> 1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0
> rethrow=0 return_oop=0}
> ││ │ ││ ; -
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12
> (line 189)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
> (line 185)
> ││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line
> 817)
> ││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51
> (line 77)
> ││ │ ││ ; -
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
> (line 234)
> 68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
> ││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0
> return_oop=0}
> ││ ││ ; - java.lang.Math::min@3 (line 1552)
> ││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line
> 44)
> ││ ││ ; -
> org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13
> (line 189)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1
> (line 185)
> ││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line
> 817)
> ││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
> ││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line
> 77)
> ││ ││ ; -
> org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15
> (line 234)
> {code}
> The hot instruction is misattributed, probably due to event instruction skid.
> But integer division is known to be expensive to implement. We can verify
> this with Agner Fog's instruction tables:
> https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv
> gives high numbers in the table provided, as expected.
> After noting all of this, we can apply the following patch, which produces
> the speedup as above:
> {code}
> diff --git
> a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> index ee47f6dd8..0c9a57bf9 100644
> ---
> a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> +++
> b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
> @@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
> public abstract class BaseFixedWidthVector extends BaseValueVector
> implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
> private final int typeWidth;
> + private final int typeWidthAsExactBitShiftOrNegOne;
> protected int lastValueCapacity;
> @@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends
> BaseValueVector
> public BaseFixedWidthVector(Field field, final BufferAllocator allocator,
> final int typeWidth) {
> super(allocator);
> this.typeWidth = typeWidth;
> + this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
> this.field = field;
> valueCount = 0;
> allocationMonitor = 0;
> @@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends
> BaseValueVector
> }
> private int getValueBufferValueCapacity() {
> - return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + if (typeWidthAsExactBitShiftOrNegOne == -1) {
> + // Slow path - integral division is very very expensive, and code here is
> part of
> + // setSafe's hot loop
> + // The JVM did not optimize integral division into bit shifts
> + return capAtMaxInt(valueBuffer.capacity() / typeWidth);
> + }
> + return capAtMaxInt(valueBuffer.capacity() >>
> typeWidthAsExactBitShiftOrNegOne);
> }
> private int getValidityBufferValueCapacity() {
> @@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends
> BaseValueVector
> return visitor.visit(this, value);
> }
> + private int log2ExactOrNeg1(int x) {
> + final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
> + if (!isPowerOfTwo) {
> + return -1;
> + }
> + return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
> + }
> +
> }
> {code}
> Attached is the generated assembly as printed by JMH, before and after. I
> renamed some variables for clarity after generating the results, but the
> logic is unchanged.
> I also did a quick test with JDK 8 - this was where I originally ran the
> benchmarks. The idiv instruction was present there too.
> An initial version of this patch cached the value - this produces about the
> same speedup.
> Are people fine with this approach?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)