Hi Lucene Devs, Looking for feedback and reviews on https://github.com/apache/lucene/pull/14178, which adds a Faiss-based codec for vector searches.
Faiss (https://github.com/facebookresearch/faiss) is *"a library for efficient similarity search and clustering of dense vectors"* and supports various features like vector transforms (eg PCA), indexing algorithms (eg IVF, HNSW, etc), quantization techniques (eg PQ), search strategies (eg 2-step refinement), different hardware (including GPUs), etc and this codec adds implicit support for *most* of these features (basically anything that can be used from its flexible index factory <https://github.com/facebookresearch/faiss/wiki/The-index-factory>). The new codec calls the C API <https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md>of Faiss (shared library intended for FFI) using Project Panama ( https://openjdk.org/projects/panama). The codec is present in the sandbox module and does *not* add a dependency on Faiss (no need to download / build Faiss if you're not going to use it!), and only relies on shared libraries being present at runtime. The PR also adds a CI step <https://github.com/apache/lucene/actions/runs/14215130397/job/39829996210?pr=14178> to regularly test it. We also made some changes <https://github.com/apache/lucene/pull/14178#issuecomment-2771809487> to upstream Faiss to be able to support this codec! A benchmark comparison <https://github.com/apache/lucene/pull/14178#issuecomment-2715605453> between the default Lucene-based search v/s this new codec (for an unfiltered HNSW search) shows ~20% speedup in search time and ~15% in indexing time (more benchmarks on the PR, like a filtered search <https://github.com/apache/lucene/pull/14178#discussion_r1941904975>), but IMO the main value of the codec lies in the ability to use different indexing and search algorithms already implemented in Faiss (+ support for future ones) without having to redo those in Java / Lucene. Thank you for your time!