[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-700028726 Thanks everyone for all the comments and suggestions! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-698085225 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-698863342 @liyafan82 thanks for the detailed review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-698085225 I've updated the PR to cache the capacities, and recompute it whenever it is changed; please take a look again. Here are the benchmarks that seemed to have shown some change (>5% either way) in an initial quick run. Any slowdown seems to be minor or within the error bounds. Manipulation functions (set*) were improved as expected, and so was a dictionary encoder benchmark. Before (152f8b08a70d403eae78ebdc89814be248008a53) ``` Benchmark (numThreads) Mode Cnt Score Error Units o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 1 avgt 15 1656.614 ± 2.424 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 2 avgt 15 2123.750 ± 72.633 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 5 avgt 15942.920 ± 4.585 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 10 avgt 15553.499 ± 17.866 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 20 avgt 15572.557 ± 16.074 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 50 avgt 15373.616 ± 42.984 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 100 avgt 15358.959 ± 21.536 us/op o.a.a.memory.util.ArrowBufPointerBenchmarks.compareBenchmark N/A avgt 15 53.190 ± 3.987 ns/op o.a.a.vector.BitVectorHelperBenchmarks.loadValidityBufferAllOne N/A avgt 15329.615 ± 14.539 ns/op o.a.a.vector.DecimalVectorBenchmarks.setBigEndianArrowBufBenchmark N/A avgt 15 15.229 ± 0.174 us/op o.a.a.vector.IntBenchmarks.setIntDirectly N/A avgt 15 11.708 ± 0.397 us/op o.a.a.vector.IntBenchmarks.setWithValueHolder N/A avgt 15 11.217 ± 0.092 us/op o.a.a.vector.IntBenchmarks.setWithWriter N/A avgt 15 22.418 ± 0.441 us/op o.a.a.vector.dictionary.DictionaryEncoderBenchmarks.testEncode N/A avgt 15 37998.387 ± 361.695 ns/op ``` After ``` Benchmark (numThreads) Mode Cnt Score Error Units o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 1 avgt 15 1656.956 ± 2.462 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 2 avgt 15 2170.146 ± 72.324 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 5 avgt 15 1004.663 ± 97.292 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 10 avgt 15589.840 ± 58.153 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 20 avgt 15534.463 ± 52.171 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 50 avgt 15372.424 ± 41.733 us/op o.a.a.algorithm.search.ParallelSearcherBenchmarks.searchBenchmark 100 avgt 15358.727 ± 21.764 us/op o.a.a.memory.util.ArrowBufPointerBenchmarks.compareBenchmark N/A avgt 15 50.731 ± 4.514 ns/op o.a.a.vector.BitVectorHelperBenchmarks.loadValidityBufferAllOne N/A avgt 15326.927 ± 3.719 ns/op o.a.a.vector.DecimalVectorBenchmarks.setBigEndianArrowBufBenchmark N/A avgt 15 4.807 ± 0.050 us/op o.a.a.vector.IntBenchmarks.setIntDirectly N/A avgt 15 1.681 ± 0.003 us/op o.a.a.vector.IntBenchmarks.setWithValueHolder N/A avgt 15 1.865 ± 0.017 us/op o.a.a.vector.IntBenchmarks.setWithWriter N/A avgt 15 1.657 ± 0.012 us/op o.a.a.vector.dictionary.DictionaryEncoderBenchmarks.testEncode N/A avgt 15 27574.581 ± 215.370 ns/op ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696381588 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696065926 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696466268 > I'm clearly missing something. Why doesn't item 2 when directly in the vector solve the same purpose as 1/3? Sorry, I didn't realize that the ArrowBuf was that restricted too; I was focusing on the usage of `lastValueCapacity` and how it got out of sync. I'll try to cache the calculated capacity inside the class itself and handle its invalidation, which should bring this back in line with option 2. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696456877 *Option 2 being the best case of the append-only builder style interface; something like IntWriter, where direct access to the buffer was not permissible, and so its safe to do capacity checks like that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696455205 Sorry, did you mean the specialized append interface (as Option 2), that assumes buffer ownership? I mislabeled the options in the paragraph you quoted (now corrected). I'm not familiar with the code, but I don't think it's safe to implement, if users can access the underlying buffers via `getDataBuffer()`? Option 2 was proposed as the (final) potentially desired interface that wasn't possible without more significant changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696452697 @lidavidm @liyafan82 @jacques-n Interpreting the results: This patch could be improved (performance wise) by more aggressive caching (option 3), at the potential expense of additional state, for an additional 35% speedup above this submitted patch. @liyafan82 @lidavidm I was wrong about the performance gain - I guess this speedup might be worth having? Please let me know if you prefer this variant, subject to the given tradeoff. I'll clean it up and resubmit if so. An append-only interface, trivially implemented, but assuming buffer ownership (Option 3), could get (up to) an additional 54% above the improved version of this patch. A type specialized variant, presumably created by templating out the various types (Option 4), could get (up to, these are very very rough results, it hasn't been verified where this speedup is coming from, it might just be the bit setting code) 18% above the specialized append-only interface. It might be possible to dive into the ASM to figure out what's actually being generated, but that might be a tad extreme for now (and unstable). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696450195 Here are the results of my testing. I'm not really that familiar with Arrow, and some of the code is sloppy, so please check that what I'm doing matches up with your expectations! I didn't check whether the benchmarks were really measuring something realistic either (although I have no specific reason to doubt that). All benchmarks were compiled and run on the repo-provided java-14-openjdk-amd64 on Ubuntu 20.04. I've disabled HT, CPU freq scaling, but did not pin CPU cores. Representative CLI invocation: ``` java -Darrow.enable_unsafe_memory_access=true -jar ../arrow/java/performance/target/benchmarks.jar '.*IntBenchmarks.setIntDirectly' -w 2 -r 2 -f 2` # JMH version: 1.21 # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java # VM options: -Darrow.enable_unsafe_memory_access=true # Warmup: 5 iterations, 2 s each # Measurement: 5 iterations, 2 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly ``` * Option 1: the original IntBenchmarks * Option 2: save the value/validity buffer capacities in a wrapper around IntVector, and then manually resize, calling .set * Option 3: save the overall value capacity in BaseFixedWidthVector.getValueCapacity, and force recomputation based on valueBuffer.capacity() and validityBuffer.capacity() (what I understand of @liyafan82 's suggestion that avoids invalidation issues around the underlying buffers) * Option 4: Call the despecialized putLong directly (The code that was linked in Iceberg, with more code to handle validity buffer) master w/o this patch: `IntBenchmarks.setIntDirectly avgt 10 11.605 ± 0.433 us/op` Option 1: `IntBenchmarks.setIntDirectly avgt 10 4.670 ± 0.095 us/op` Option 2: `IntBenchmarks.setIntDirectlyWithSpecializedWriterCachedCapacities avgt 10 2.228 ± 0.073 us/op` Option 3: `IntBenchmarks.setIntDirectly avgt 10 3.446 ± 0.022 us/op` Option 4: `IntBenchmarks.setIntDirectlyDirectlyWithPutInt avgt 10 1.887 ± 0.009 us/op` Code for Option 1: This patch, with the master benchmark code, as extracted: ``` +@State(Scope.Benchmark) +public class IntBenchmarks { + + private static final int VECTOR_LENGTH = 1024; + + private static final int ALLOCATOR_CAPACITY = 1024 * 1024; + + private BufferAllocator allocator; + + private IntVector vector; + + @Setup + public void prepare() { +allocator = new RootAllocator(ALLOCATOR_CAPACITY); +vector = new IntVector("vector", allocator); +vector.allocateNew(VECTOR_LENGTH); +vector.setValueCount(VECTOR_LENGTH); + } ... + @Benchmark + @BenchmarkMode(Mode.AverageTime) + @OutputTimeUnit(TimeUnit.MICROSECONDS) + public void setIntDirectly() { +for (int i = 0; i < VECTOR_LENGTH; i++) { + vector.setSafe(i, i % 3 == 0 ? 0 : 1, i); +} + } ``` Code for Option 2: ``` diff --git a/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java b/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java new file mode 100644 index 0..c4201dea8 --- /dev/null +++ b/java/performance/src/main/java/org/apache/arrow/vector/CapacityCachingIntVector.java @@ -0,0 +1,27 @@ +package org.apache.arrow.vector; + +public class CapacityCachingIntVector { +private final IntVector backingVector; +private long backingVectorCachedCapacity; + +public CapacityCachingIntVector(IntVector backing) { +this.backingVector = backing; +synchronizeCapacity(); +} + +public void setSafe(int index, int isSet, int value) { +if (index >= backingVectorCachedCapacity) { +// unlikely +this.backingVector.setSafe(index, isSet, value); // force a resize +synchronizeCapacity(); +return; +} +// likely +this.backingVector.set(index, isSet, value); +return; +} + +private void synchronizeCapacity() { +backingVectorCachedCapacity = backingVector.getValueCapacity(); +} +} ``` Code for Option 3: ``` @@ -182,11 +187,19 @@ public abstract class BaseFixedWidthVector extends BaseValueVector */ @Override public int getValueCapacity() { -return Math.min(getValueBufferValueCapacity(), getValidityBufferValueCapacity()); +final long valueBufferCapacity = valueBuffer.capacity(); +final long validityBufferCapacity =
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696381588 @jacques-n I haven't done very much investigation on other speedups - I just happened to notice performance irregularities as compared to our other (legacy) codepaths, and realized that we were spending a significant number of cycles dividing integers. If you're referring to the linked code, I have nothing to do with the Iceberg project. I just happened to chance upon this code. I haven't investigated further patterns, although I'd want to do that in the future, time permitting. I think it would be possible to measure the upper bound of performance, and address some of the questions (including that) in this discussion, if we did the following benchmarks side-by-side over the given variants: * the original IntBenchmarks * save the value/validity buffer capacities outside arrow code, and then manually resize, calling .set * save the overall value capacity in BaseFixedWidthVector.getValueCapacity, and force recomputation based on valueBuffer.capacity()/validityBuffer.capacity() (what I understand of @liyafan82 's suggestion that avoids invalidation issues around the underlying buffers) * Call the despecialized putLong directly (The code that was linked in Iceberg) The second option should highlight the upper bound of performance reachable by aggressively caching. The last option should highlight the upper bound of performance reachable by de-specialization (and therefore removing the virtual call chain). I'll write and run these later today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696065926 Yep, I'm worried that the buffers can be changed unintentionally/intentionally, and as you pointed out, needs to be always checked for and trapped. I'm wondering if the overhead of these checks are just inherent in such a flexible API, and quite significant (for ArrowBuf without flipping the `arrow.enable_unsafe_memory_access=false` flag globally, it does consume ~30% of CPU when `setSafe`ing), and worth removing for the general use-case. (where this by a Builder API that defers generalization, or a special (local) ArrowBuf mode if possible). We could jump into the unsafe APIs, like [this](https://github.com/apache/iceberg/issues/9#issuecomment-517486998) [code](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java#L107), but a more generic out-of-the-box solution that makes the happy path for serialization easy would be nice. Ideally, we'd remove at least ~70% of the CPU cost of calling `setSafe` before this patch, for users that only want to append data, with an eye towards pushing it down even more given the simpler code. I'm less familiar with the Arrow code than I'd like to be though (right now), and I'm not sure if there are other ways around this that don't involve doing `arrow.enable_unsafe_memory_access=true`, and various other tweaks. There is inherent value in keeping only one or two (high level) interfaces for building Arrow buffers, and optimizing for the common use case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-695122375 > I am thinking about another way: can we simply save the value/validity buffer capacities? so we can further improve the performance by avoiding the if branch in getValueBufferValueCapacity and the if branch in capAtMaxInt Let me check again on this; I might be wrong about the performance differences and safety if we aggressively cache; it was something I gave less thought to because of our specific use case (`arrow.enable_unsafe_memory_access=false`), but it might be more beneficial to do in general. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-695114615 @liyafan82 Thanks for taking a look! > I am a little surprised that JIT failed to transform the integer divistion to a right shift, as it can be easily deduced that the type width for an IntVector is always 4. I was pretty surprised too - it turned out that a huge chunk of cycles that we spent serializing data was devoted to this (BaseFixedWidthVector capacity bounds checking) + ArrowBuf bounds checking. This held true across the two/three JDK versions I used, across servers. It was not immediately evident where all the cycles are going on a Java-level profiler, but if you pull out the assembly, I think its quite clear (see the linked JIRA). > I am thinking about another way: can we simply save the value/validity buffer capacities? so we can further improve the performance by avoiding the if branch in getValueBufferValueCapacity and the if branch in capAtMaxInt I did this for a prototype internally - I think this is faster, but if done before even calling setSafe, implicitly assumes that no-one else is writing to the ArrowBuf. If I remember correctly, if we read the capacities directly and do these checks in getValueBufferValueCapacity, deeper in the setSafe call, performance was marginally comparable (at least within <20%?). However, this needs to be checked again. I wasn't too concerned about this, as we run with arrow.enable_unsafe_memory_access=false` set. We might have been able to make it faster on this front of checking the capacity, but its wy into diminishing returns (for us at least). We also lost a lot of performance over keeping `arrow.enable_unsafe_memory_access=false`, and I was wondering if there was a better way of eliminating all bounds checks, both on the BaseFixedWidthVector, and in ArrowBuf.chk. When we run our production code with `arrow.enable_unsafe_memory_access=false`, the breakdowns of the setSafe CPU usage (before this patch) according to an async sampling profiler are as following: 53% for handleSafe (which this patch is expected to remove the majority of) ~30% for the ArrowBuf.chk checks in the `set(index, value)` code. I was thinking that if we had a specialized vector append API, that guaranteed ownership of the buffer, we could throw out all of these bounds checks safely (in handleSafe, and ArrowBuf.chk), as well as have a restricted API to optimize building buffers out of. Something like the original [IntWriter](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/complex/writer/BaseWriter.ScalarWriter.html) interface, but with stronger ownership semantics around the ArrowBuf? Is that something you think is possible/desirable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] josiahyan commented on pull request #8214: ARROW-9965: [Java] Improve performance of BaseFixedWidthVector.setSafe by optimizing capacity calculations
josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-694578807 @lidavidm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org