+1 to this APE. Supporting ANN queries using a vector index, especially a novel one like this, is awesome. I think there are still some loose ends about the distance functions and some of the exact form of the WITH clauses but these are minor details; I don't think they need to block acceptance.
On Wed, Jan 14, 2026 at 11:57 AM Taewoo Kim <[email protected]> wrote: > > Thanks for the clarification! > > Best, > Taewoo > > > On Wed, Jan 14, 2026 at 11:36 AM Shiva Jahangiri <[email protected]> wrote: > > > Hi Taewoo, > > > > Thanks! So each data partition will have its own vector index as secondary > > index, and so each data partition does its own sampling of data, creates > > its own static structure, etc. using its own data. That means there is no > > overlapping or connection between the data or static structure of vector > > indexes of different data partitions. > > > > For top-k, we basically ask each data partition to give us their top-k > > results. If we have N data partitions, we will get N*k results which then > > in a global step we get its top-k results out. > > > > Each node can have multiple data partitions which makes it simple in the > > shared-nothing architecture. In the cloud mode, the code is modified to > > make sure that the search goes through all data partitions even if multiple > > of them are managed by a single compute node. > > > > Best, > > Shiva > > > > > > > > > > > > > > On Wed, Jan 14, 2026 at 10:39 AM Taewoo Kim <[email protected]> wrote: > > > > > Hi Shiva, > > > > > > Thanks for your reply. > > > > > > Somehow I got confused about TOP-K. I thought each partition could have > > an > > > overlapping portion from the static part. So, will each partition be > > > processed on a single node? > > > > > > Regarding the memory part, I'm glad to know that the size is not that > > huge. > > > :-) > > > > > > Best, > > > Taewoo > > > > > > > > > On Wed, Jan 14, 2026 at 9:25 AM Shiva Jahangiri <[email protected]> > > > wrote: > > > > > > > Hi Taewoo, > > > > > > > > Thanks for the great questions. > > > > > > > > With regard to the distributed top-k, each data partition will return > > its > > > > top-k result (if it has at least k records) and then we get the global > > > > top-k based on these local ones (somewhat similar to group by). > > > > > > > > With regard to the memory usage, the static part has to remain in the > > > > memory and our experiments have showed that its size is not that large > > > > compared to the size of the data (for 18GB of data stored in one data > > > > partition,1 Million records each with an embedding with the size of 960 > > > > dimensions the static part takes 11MB of memory). The reason that our > > > index > > > > is not memory hungry is that we only have embeddings in the static > > part, > > > > the data pages in the dynamic part where the records will be inserted > > > does > > > > not store the embedding of the record, instead it stores its distance > > to > > > > the cluster’s centroid. We will later on explore storing the quantized > > > > vectors for each record (helps reducing execution time by sending > > lesser > > > > records to the primary index for distance calculations) and that might > > > > change the size of the dynamic section. It is important to note that > > each > > > > time a new memory component is created the static structure is copied > > > into > > > > the memory component and the dynamic part will be filled with the > > > incoming > > > > data. > > > > > > > > > > > > > > > > Best, > > > > Shiva > > > > > > > > Shiva Jahangiri > > > > Assistant Professor in Computer Science and Engineering Department > > > > Santa Clara University > > > > > > > > > > > > > > > > On Tue, Jan 13, 2026 at 3:24 PM Taewoo Kim <[email protected]> wrote: > > > > > > > > > Hi Shiva, > > > > > > > > > > This proposal looks good. > > > > > > > > > > I have two questions (sorry if I missed) > > > > > > > > > > How are we going to handle distributed execution when dealing with > > > top-K > > > > > ANN? > > > > > How does the memory component look like in terms of configurable > > size? > > > My > > > > > naive understanding is that Vector index itself is memory-hungry. > > > > > > > > > > Best, > > > > > Taewoo > > > > > > > > > > > > > > > On Tue, Jan 13, 2026 at 2:04 PM Shiva Jahangiri <[email protected]> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > Initiating discussion to add vector index in AsterixDB to support > > > > > > approximate nearest neighbor (ANN) queries. > > > > > > > > > > > > Feature: Adding Vector Index > > > > > > > > > > > > Details: Currently AsterixDB does not support approximate nearest > > > > queries > > > > > > and similarity search on vector embeddings. This proposal suggests > > > the > > > > > > first design of a tree-based vector indexing supporting top-k ANN > > > > queries > > > > > > which is fully compatible with LSM structure of AsterixDB's > > storage. > > > As > > > > > > part of this proposal we provide support for : > > > > > > > > > > > > * Adding vector distance functions to support K-Nearest Neighbor > > > (KNN) > > > > > > queries > > > > > > * Adding vector index to support ANN queries > > > > > > * Adding support for INCLUDE fields in vector index to better > > support > > > > > > filtered similarity search. > > > > > > > > > > > > APE: > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*31*3A*Vector*Index__;KyUrKw!!MLMg-p0Z!D1Zmu-mN_byA8sV_P_p7aLlbYJIg0b19njsPeaMkVTovtWeW0IsD-CIgOo0MJ7_7t3pZsk63GqP6lfK1$ > > > > > > > > > > > > Thanks, > > > > > > Shiva > > > > > > > > > > > > -- > > > > > > Shiva Jahangiri > > > > > > Assistant Professor in Computer Science and Engineering Department > > > > > > Santa Clara University > > > > > > > > > > > > > > > > > > > > > > > > -- > > Shiva Jahangiri > > Assistant Professor in Computer Science and Engineering Department > > Santa Clara University > >
