josiahyan commented on pull request #8214:
URL: https://github.com/apache/arrow/pull/8214#issuecomment-696450195
Here are the results of my testing. I'm not really that familiar with Arrow,
and some of the code is sloppy, so please check that what I'm doing matches up
with your expectations! I didn't check whether the benchmarks were really
measuring something realistic either (although I have no specific reason to
doubt that).
All benchmarks were compiled and run on the repo-provided
java-14-openjdk-amd64 on Ubuntu 20.04. I've disabled HT, CPU freq scaling, but
did not pin CPU cores.
Representative CLI invocation:
```
java -Darrow.enable_unsafe_memory_access=true -jar
../arrow/java/performance/target/benchmarks.jar
'.*IntBenchmarks.setIntDirectly' -w 2 -r 2 -f 2`
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 2 s each
# Measurement: 5 iterations, 2 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
```
* Option 1: the original IntBenchmarks
* Option 2: save the value/validity buffer capacities in a wrapper around
IntVector, and then manually resize, calling .set
* Option 3: save the overall value capacity in
BaseFixedWidthVector.getValueCapacity, and force recomputation based on
valueBuffer.capacity() and validityBuffer.capacity() (what I understand of
@liyafan82 's suggestion that avoids invalidation issues around the underlying
buffers)
* Option 4: Call the despecialized putLong directly (The code that was
linked in Iceberg, with more code to handle validity buffer)
master w/o this patch: `IntBenchmarks.setIntDirectly avgt 10 11.605 ±
0.433 us/op`
Option 1: `IntBenchmarks.setIntDirectly avgt 10 4.670 ± 0.095 us/op`
Option 2: `IntBenchmarks.setIntDirectlyWithSpecializedWriterCachedCapacities
avgt 10 2.228 ± 0.073 us/op`
Option 3: `IntBenchmarks.setIntDirectly avgt 10 3.446 ± 0.022 us/op`
Option 4: `IntBenchmarks.setIntDirectlyDirectlyWithPutInt avgt 10 1.887
± 0.009 us/op`
Code for Option 1:
This patch, with the master benchmark code, as extracted:
```
+@State(Scope.Benchmark)
+public class IntBenchmarks {
+
+ private static final int VECTOR_LENGTH = 1024;
+
+ private static final int ALLOCATOR_CAPACITY = 1024 * 1024;
+
+ private BufferAllocator allocator;
+
+ private IntVector vector;
+
+ @Setup
+ public void prepare() {
+ allocator = new RootAllocator(ALLOCATOR_CAPACITY);
+ vector = new IntVector("vector", allocator);
+ vector.allocateNew(VECTOR_LENGTH);
+ vector.setValueCount(VECTOR_LENGTH);
+ }
...
+ @Benchmark
+ @BenchmarkMode(Mode.AverageTime)
+ @OutputTimeUnit(TimeUnit.MICROSECONDS)
+ public void setIntDirectly() {
+ for (int i = 0; i < VECTOR_LENGTH; i++) {
+ vector.setSafe(i, i % 3 == 0 ? 0 : 1, i);
+ }
+ }
```
Code for Option 2:
```
diff --git
a/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java
b/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java
new file mode 100644
index 000000000..c4201dea8
--- /dev/null
+++
b/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java
@@ -0,0 +1,27 @@
+package org.apache.arrow.vector;
+
+public class CapacityCachingIntVector {
+ private final IntVector backingVector;
+ private long backingVectorCachedCapacity;
+
+ public CapacityCachingIntVector(IntVector backing) {
+ this.backingVector = backing;
+ synchronizeCapacity();
+ }
+
+ public void setSafe(int index, int isSet, int value) {
+ if (index >= backingVectorCachedCapacity) {
+ // unlikely
+ this.backingVector.setSafe(index, isSet, value); // force a
resize
+ synchronizeCapacity();
+ return;
+ }
+ // likely
+ this.backingVector.set(index, isSet, value);
+ return;
+ }
+
+ private void synchronizeCapacity() {
+ backingVectorCachedCapacity = backingVector.getValueCapacity();
+ }
+}
```
Code for Option 3:
```
@@ -182,11 +187,19 @@ public abstract class BaseFixedWidthVector extends
BaseValueVector
*/
@Override
public int getValueCapacity() {
- return Math.min(getValueBufferValueCapacity(),
getValidityBufferValueCapacity());
+ final long valueBufferCapacity = valueBuffer.capacity();
+ final long validityBufferCapacity = validityBuffer.capacity();
+ if (underlyingValueBufferCapacityCached != valueBufferCapacity ||
underlyingValidityBufferCapacityCached != validityBufferCapacity) {
+ // trap modification
+ underlyingValueBufferCapacityCached = valueBufferCapacity;
+ underlyingValidityBufferCapacityCached = validityBufferCapacity;
+ valueCapacityCached = Math.min(getValueBufferValueCapacity(),
getValidityBufferValueCapacity());
+ }
+ return valueCapacityCached;
}
private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
}
```
Code for Option 4:
```
+ @Benchmark
+ @BenchmarkMode(Mode.AverageTime)
+ @OutputTimeUnit(TimeUnit.MICROSECONDS)
+ public void setIntDirectlyDirectlyWithPutInt() {
+ ArrowBuf dataBuffer = vector.getDataBuffer();
+ ArrowBuf validityBuffer = vector.getValidityBuffer();
+ long validityBatch = 0;
+ int i;
+ for (i = 0; i < VECTOR_LENGTH; i++) {
+ if (i % 64 == 0 && i != 0) {
+ // time to flush to validity buffer
+ // this implementation does not consider endianness, it only
emulates flushing an incrementally built bit vector
+ validityBuffer.setLong((i - 1)>> 3, validityBatch);
+ }
+ dataBuffer.setInt(IntVector.TYPE_WIDTH * i, i);
+ validityBatch = (validityBatch << 1) | (i % 3 == 0 ? 0 : 1);
+ }
+ if (i % 64 > 0) {
+ // time to flush to validity buffer
+ // this implementation does not consider endianness or OOB access, it
only emulates flushing an incrementally built bit vector
+ validityBuffer.setLong(i >> 3, validityBatch);
+ }
+ }
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]