Q: Are there licensing gotchas with approach 1 (which otherwise sounds
nicer from a maintenance standpoint)? We need to be sure that everything
we use is Apache-okay in terms of licensing. It would be fun to see
some preliminary numbers on perf, e.g., for KNN, each way, were it as
easy as changing which function(s) to call... :-) That would help
quantify the two options (vs. each other and vs. none) too.
On 6/10/25 7:24 AM, Calvin Dani wrote:
Hi,
As part of adding vector functionality to AsterixDB, I have been exploring
possible optimizations for vector computations. One promising direction is
leveraging SIMD operations to accelerate these calculations. Although Java
offers autovectorization to utilize SIMD, this approach requires the
operations to be branchless (i.e., no conditional branching like if/else),
and it may not always be triggered when vector calculations get complex.
I have considered two main options for SIMD-enabled vector computation:
1. Java Vector API: Introduced as an incubation feature since Java 17, the
Vector API is part of the long-term Project Valhalla. While it remains in
incubation and likely won’t be finalized until Project Valhalla completes,
the API already supports the basic operations needed for our distance
metrics, such as Euclidean Distance, Manhattan Distance, Cosine Similarity,
and Dot Product. It also provides a primitive Vector<E> type which could
serve as a native storage for embeddings.
2. Foreign Function & Memory API: This allows calling optimized C/C++
libraries directly from Java. We could either leverage existing
highly-optimized vector computation libraries or implement our own native
code. However, packaging and ensuring compatibility of native libraries
across different target platforms may introduce complexity.
If you are aware of other solutions or have feedback on these options, I
would appreciate your insights.
Thank you,
Calvin Dani