Pulkitg64 commented on PR #15549: URL: https://github.com/apache/lucene/pull/15549#issuecomment-3842988189
I think I found the problem. I was running these benchmarks on m5.12x large machines. This instance doesn't support float16 intrinsic operations. So, I changed my instance to m7g.8x large machines and here are the results: I am seeing much better performance with float16 encoding now. The latency with float16 is still 50% higher than float32. I think this is expected because there is extra conversion between float16 and float32. Also I haven't implemented bulk scoring as well, so maybe that will help us in some latency . The indexing rate is improved by 10% (this maybe because of fast fetching of smaller vectors). | Encoding | recall | latency(ms) | netCPU | avgCpuCount | visited | index(s) | index_docs/s | force_merge(s) | index_size(MB) | |----------|---------|--------------|---------|--------------|----------|-----------|---------------|-----------------|----------------| | float16 | 0.992 | 3.229 | 3.154 | 0.977 | 6820 | 17.01 | 5879.93 | 0.01 | 207.65 | | float32 | 0.990 | 2.111 | 2.066 | 0.978 | 6858 | 19.18 | 5214.04 | 22.81 | 403.03 | * Profiler for float16: ``` 40.69% 82592 jdk.incubator.vector.Float16Vector#reduceLanesTemplate() [Inlined code] 20.50% 41612 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code] 5.41% 10983 jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code] 5.00% 10158 org.apache.lucene.index.Float16VectorValues$1#vectorValue() [Inlined code] 3.92% 7964 jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code] 2.21% 4488 jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code] 1.71% 3467 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] 1.43% 2909 org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code] 1.19% 2408 org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code] 1.18% 2386 org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code] 0.87% 1763 java.lang.invoke.VarHandleSegmentAsInts#get() [Inlined code] 0.85% 1722 org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code] 0.75% 1514 org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code] 0.63% 1278 org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code] 0.62% 1251 jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code] 0.61% 1247 org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [JIT compiled code] 0.47% 961 java.util.ArrayList#elementData() [Inlined code] 0.47% 951 org.apache.lucene.util.hnsw.NeighborArray#size() [Inlined code] 0.45% 904 sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code] 0.44% 894 org.apache.lucene.util.FixedBitSet#getAndSet() [JIT compiled code] 0.40% 813 org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code] 0.36% 730 sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code] 0.35% 710 org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$Float16ScoringSupplier$1#setScoringOrdinal() [Inlined code] 0.34% 699 org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code] 0.34% 689 org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor() [Inlined code] 0.33% 677 jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code] 0.31% 623 sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code] 0.30% 616 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() [Inlined code] 0.26% 521 jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code] 0.24% 495 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code] ``` * Profiler for float32 ``` 63.09% 125971 jdk.incubator.vector.FloatVector#reduceLanesTemplate() [Inlined code] 5.72% 11426 jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code] 3.86% 7714 org.apache.lucene.index.FloatVectorValues$1#vectorValue() [Inlined code] 2.97% 5930 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code] 1.58% 3155 org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code] 1.35% 2691 org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code] 1.28% 2565 org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code] 1.26% 2515 jdk.incubator.vector.FloatVector#fromArray0Template() [Inlined code] 1.25% 2500 org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code] 1.16% 2326 org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code] 1.11% 2212 org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [Inlined code] 1.08% 2164 jdk.incubator.vector.FloatVector#lanewiseTemplate() [Inlined code] 1.02% 2029 org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code] 0.69% 1381 jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code] 0.58% 1165 org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code] 0.54% 1075 sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code] 0.53% 1067 sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code] 0.49% 985 org.apache.lucene.util.VectorUtil#normalizeToUnitInterval() [Inlined code] 0.45% 902 org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code] 0.43% 858 org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code] 0.41% 811 org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1#setScoringOrdinal() [Inlined code] 0.39% 775 org.apache.lucene.util.GroupVIntUtil#readGroupVInt() [Inlined code] 0.38% 749 org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get() [JIT compiled code] 0.37% 739 sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code] 0.36% 716 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code] 0.28% 556 jdk.incubator.vector.FloatVector#fromMemorySegment() [Inlined code] 0.24% 479 org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek() [Inlined code] 0.24% 475 java.util.concurrent.locks.ReentrantReadWriteLock#readLock() [Inlined code] 0.21% 427 java.util.ArrayList#elementData() [Inlined code] 0.20% 402 java.util.concurrent.locks.AbstractQueuedLongSynchronizer#compareAndSetState() [Inlined code] ``` #### Next Steps: Understand the flame chart and try to further improve the float16 encoding benchmark runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
