kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3246817018

   Thank you for your input everyone!
   
   > I'm wondering if ACORN would work for this use case
   
   @dungba88 while ACORN may speed up the graph-search component of a 
pre-filtered search -- it still requires building a `BitSet` of accepted docs 
at query time + searching through a large graph containing _all_ documents. A 
positive consequence of this search is that a user can filter on _any_ 
arbitrary `Query`
   
   The benefit I'm suggesting is for when a user wants to filter on a 
constraint known at index-time, and consequently move the bulk of work to 
indexing to create and search smaller graphs -- which may be acceptable because 
the size of HNSW graphs is smaller than raw vector data
   
   In an e-commerce marketplace, this could mean having different graphs for 
(often overlapping) categories of products -- for example:
   - If a user first performs a broad search on the entire catalog, we can 
search for vector results without any filter
   - But if they "scope" into a category (like electronics, furniture, etc.) -- 
we need to create a `BitSet` of all products in the category, traverse a large 
HNSW graph, but only "collect" vectors matching the filter
   - The `BitSet` can be shared across queries, but a separate one needs to be 
created for each index checkpoint. We'll also need a separate `BitSet` for 
_each_ category -- which is stored on-heap and can become memory-intensive for 
a large number of unique categories (or filters in general)
   - Since these categories are known at index-time, I'm proposing to create 
separate HNSW graphs for _each_ category (or filter in general). These will be 
smaller, off-heap, and backed by the same set of raw vectors!
   - This _will_ increase indexing time and size, but at the benefit of lower 
RAM pressure and faster search (smaller graphs) -- and we can let the user make 
this tradeoff decision..
   
   > The use-case you have mentioned is pretty interesting.
   
   @navneet1v thanks, please let me know if OpenSearch has use-cases which will 
benefit from this proposal (e.g. due to multi-tenant nodes, or specific 
applications)
   
   @jpountz while I started out thinking of adding an explicit set of IDs for 
each vector, and then create separate graphs for each ID -- I realize this 
approach will require non-trivial API changes to deal with IDs as a new 
concept. However, there may an alternate way of doing the same thing (see 
following comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to