Re: Vector Computation Optimization Approaches for AsterixDB

Calvin Dani Mon, 28 Jul 2025 12:31:40 -0700

Hi,

I've been reviewing the technologies that rely on external libraries or
engines to offload intensive operations (Apache Gluten, Velox, Datafusion
Comet). A recurring pattern is the use of JNI, with systems like Spark
incorporating designs to ensure memory management is properly handled when
interfacing with these libraries.


I plan to dive deeper into each implementation to understand how memory is
managed at that layer.

Java 21’s Project Panama introduces several promising features that
simplify memory management and reduce the verbosity of JNI calls. However,
the project is still in its incubation phase.

Interestingly, Apache Lucene has already started leveraging Project Panama,
specifically for Foreign Function Invocation (e.g., madvise) and the Vector
API. Despite its incubating status, Lucene uses an approach to manage JDK
version-specific APIs via "apijars." Here’s a relevant quote from the
article:

> “The JDK Vector API, being developed in Project Panama, has been
> incubating for quite a while now. The incubating status is not a reflection
> of its quality, but more a consequence of a dependency on other exciting
> work happening in OpenJDK, namely value types. Lucene has a novel way of
> leveraging non-final APIs in the JDK — by building against an ‘apijar’
> containing the JDK-version specific APIs. This is a pragmatic approach that
> we don’t take lightly. Lucene still has the scalar variants of these
> low-level primitive operations. The version of the implementation is
> selectable at startup.”

Full article link: Accelerating vector search with SIMD instructions –
Elastic
<https://www.elastic.co/blog/accelerating-vector-search-simd-instructions>

These are just some initial thoughts on the viability of the Panama
Project, curious to hear what others think as well.

Best regards,
Calvin Dani

On Fri, Jun 13, 2025 at 7:53 AM Mike Carey <dtab...@gmail.com> wrote:

> That reminds me - once upon a time, plan
> serialization/distribution/deserialization using Java serialization was
> kind of an expensive part of our path, when we were trying to shave off
> costs for little queries.  I wonder if we should look at that again
> sometime?  (Not our most urgent problem, this just reminded me.)
>
> Cheers,
>
> Mike
>
> On 6/13/25 3:30 AM, Wail Alkowaileet wrote:
> > Quoting Photon
> > <https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf>
> > Paper:
> >> After query planning, DBR launches tasks to execute the stages of the
> >> plan. In a task with Photon, the Photon execution node first serializes
> the
> >> Photon part of the plan into a Protobuf [6] message. This message is
> passed
> >> via the Java Native Interface (JNI) [8] to the Photon C++ library, which
> >> deserializes the Protobuf and converts it into a Photon-internal plan.
> >
> > Let's see what others have done. E.g., Photon, Velox (+ Apache Gluten to
> > use Velox in Spark), Apache DataFusion Comet (Apache DataFusion is
> written
> > in Rust).
> >
> > On Wed, Jun 11, 2025 at 1:55 AM Calvin Dani<calvinthomas.d...@gmail.com>
> > wrote:
> >
> >> Yes, Ill look into the JNA project too and explore approach 2 with both
> >> FFM and JNA.
> >>
> >> I’ll prototype both approach 1 and 2 and update with a status in here.
> >>
> >>> On Jun 10, 2025, at 1:50 PM, Ian Maxon<ima...@apache.org> wrote:
> >>>
> >>> The Vector API is in OpenJDK, so I think the licensing should be OK:
> >>> https://openjdk.org/jeps/508
> >>>
> >>> The main problem is the fact it isn't a stable API yet, and it relies
> >>> on Valhalla. It would be a judgement call on how much we expect it to
> >>> change over time, and how difficult it would be to migrate things to
> >>> follow those changes. It would also be a bet that by the time
> >>> everything is done, these set of JDK features are more or less
> >>> stabilized.
> >>>
> >>> Using FFI/JNI would be a more traditional way to go about it. FFI is
> >>> new and better than JNI, so if we choose to go with that, it should be
> >>> less painful. FFI is a preview feature, which is less risky than an
> >>> incubating feature.
> >>>
> >>> There is also the JNA project, which wraps JNI to make it simpler:
> >>> https://github.com/java-native-access/jna . I'm assuming most of the
> >>> libraries we might want to use are mostly computational, so they
> >>> wouldn't have many platform-specific dependencies, just architecture
> >>> specific ones. I think it also handles the build aspect of it, which
> >>> FFI doesn't directly. Assuming the libraries we would want to use
> >>> aren't in libc or otherwise can't be assumed to be present, we would
> >>> have to include them in the jar somehow.
> >>>
> >>>
> >>>> On Tue, Jun 10, 2025 at 8:27 AM Mike Carey<dtab...@gmail.com> wrote:
> >>>>
> >>>> Q:  Are there licensing gotchas with approach 1 (which otherwise
> sounds
> >>>> nicer from a maintenance standpoint)? We need to be sure that
> everything
> >>>> we use is Apache-okay in terms of licensing.  It would be fun to see
> >>>> some preliminary numbers on perf, e.g., for KNN, each way, were it as
> >>>> easy as changing which function(s) to call...  :-)  That would help
> >>>> quantify the two options (vs. each other and vs. none) too.
> >>>>
> >>>>> On 6/10/25 7:24 AM, Calvin Dani wrote:
> >>>>> Hi,
> >>>>>
> >>>>> As part of adding vector functionality to AsterixDB, I have been
> >> exploring
> >>>>> possible optimizations for vector computations. One promising
> >> direction is
> >>>>> leveraging SIMD operations to accelerate these calculations. Although
> >> Java
> >>>>> offers autovectorization to utilize SIMD, this approach requires the
> >>>>> operations to be branchless (i.e., no conditional branching like
> >> if/else),
> >>>>> and it may not always be triggered when vector calculations get
> >> complex.
> >>>>> I have considered two main options for SIMD-enabled vector
> computation:
> >>>>>
> >>>>> 1. Java Vector API: Introduced as an incubation feature since Java
> 17,
> >> the
> >>>>> Vector API is part of the long-term Project Valhalla. While it
> remains
> >> in
> >>>>> incubation and likely won’t be finalized until Project Valhalla
> >> completes,
> >>>>> the API already supports the basic operations needed for our distance
> >>>>> metrics, such as Euclidean Distance, Manhattan Distance, Cosine
> >> Similarity,
> >>>>> and Dot Product. It also provides a primitive Vector<E> type which
> >> could
> >>>>> serve as a native storage for embeddings.
> >>>>>
> >>>>> 2. Foreign Function & Memory API: This allows calling optimized C/C++
> >>>>> libraries directly from Java. We could either leverage existing
> >>>>> highly-optimized vector computation libraries or implement our own
> >> native
> >>>>> code. However, packaging and ensuring compatibility of native
> libraries
> >>>>> across different target platforms may introduce complexity.
> >>>>>
> >>>>> If you are aware of other solutions or have feedback on these
> options,
> >> I
> >>>>> would appreciate your insights.
> >>>>>
> >>>>> Thank you,
> >>>>> Calvin Dani
> >>>>>
> >

Re: Vector Computation Optimization Approaches for AsterixDB

Reply via email to