kaivalnp commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3246817018
Thank you for your input everyone! > I'm wondering if ACORN would work for this use case @dungba88 while ACORN may speed up the graph-search component of a pre-filtered search -- it still requires building a `BitSet` of accepted docs at query time + searching through a large graph containing _all_ documents. A positive consequence of this search is that a user can filter on _any_ arbitrary `Query` The benefit I'm suggesting is for when a user wants to filter on a constraint known at index-time, and consequently move the bulk of work to indexing to create and search smaller graphs -- which may be acceptable because the size of HNSW graphs is smaller than raw vector data In an e-commerce marketplace, this could mean having different graphs for (often overlapping) categories of products -- for example: - If a user first performs a broad search on the entire catalog, we can search for vector results without any filter - But if they "scope" into a category (like electronics, furniture, etc.) -- we need to create a `BitSet` of all products in the category, traverse a large HNSW graph, but only "collect" vectors matching the filter - The `BitSet` can be shared across queries, but a separate one needs to be created for each index checkpoint. We'll also need a separate `BitSet` for _each_ category -- which is stored on-heap and can become memory-intensive for a large number of unique categories (or filters in general) - Since these categories are known at index-time, I'm proposing to create separate HNSW graphs for _each_ category (or filter in general). These will be smaller, off-heap, and backed by the same set of raw vectors! - This _will_ increase indexing time and size, but at the benefit of lower RAM pressure and faster search (smaller graphs) -- and we can let the user make this tradeoff decision.. > The use-case you have mentioned is pretty interesting. @navneet1v thanks, please let me know if OpenSearch has use-cases which will benefit from this proposal (e.g. due to multi-tenant nodes, or specific applications) @jpountz while I started out thinking of adding an explicit set of IDs for each vector, and then create separate graphs for each ID -- I realize this approach will require non-trivial API changes to deal with IDs as a new concept. However, there may an alternate way of doing the same thing (see following comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org