Oh, thanks for pointing that out, I hadn't seen the issue: I think
it's roughly the same idea, we were discussing off-line (Kaival joined
our office in Boston recently). Maybe let's move the discussion to
that issue and iterate there

On Thu, Jun 5, 2025 at 2:44 PM Michael Froh <msf...@gmail.com> wrote:
>
> I'm wondering if this is the same idea that Kaival is proposing in
> https://github.com/apache/lucene/issues/14758 (Support multiple HNSW graphs
> backed by the same vectors).
>
> On Thu, Jun 5, 2025 at 11:32 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
> > I do think there could be many interesting use cases for building
> > multiple graphs from a single set of vectors.  For example, one might
> > want to sometimes search all the docs, sometimes search the one subset
> > and other times another subset; baking the constraint into the graph
> > construction would be lead to more efficient searches than the other
> > graph search filtering we can do today (pre- and post-filtering) and
> > there could be use cases where the constraints are so very often
> > present that we would want to pay the up-front cost of computing
> > multiple graphs without paying the cost of storing the same vectors
> > multiple times in the index.  This isn't supported today but I think
> > would be a welcome contribution.
> >
> > On Wed, Jun 4, 2025 at 3:51 AM Ravikumar Govindarajan
> > <ravikumar.govindara...@gmail.com> wrote:
> > >
> > > >
> > > > I wonder if you could influence the graph search by incorporating the
> > > > partition key (customer id?) to the vectors somehow? If this was done
> > > > well it should lead to a natural clustering of the graph.
> > > >
> > >
> > > I can explore further on this. Thanks for the pointers..
> > >
> > > On Mon, Jun 2, 2025 at 11:14 PM Michael Sokolov <msoko...@gmail.com>
> > wrote:
> > >
> > > > I wonder if you could influence the graph search by incorporating the
> > > > partition key (customer id?) to the vectors somehow? If this was done
> > > > well it should lead to a natural clustering of the graph.
> > > >
> > > > On Mon, Jun 2, 2025 at 11:32 AM Ravikumar Govindarajan
> > > > <ravikumar.govindara...@gmail.com> wrote:
> > > > >
> > > > > Hi Michael,
> > > > >
> > > > > The docs range could vary in extremes  from few 10s to
> > tens-of-thousands
> > > > > and in very heavy usage cases, 100k and above… in a single segment
> > > > >
> > > > > Filtered Hnsw like you said uses a single graph.., which could be
> > better
> > > > if
> > > > > designed as sub-graphs
> > > > >
> > > > > On Mon, 2 Jun 2025 at 5:42 PM, Michael Sokolov <msoko...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > How many documents do you anticipate in a typical sub range? If
> > it's
> > > > in the
> > > > > > hundreds or even low thousands you would be better off without
> > hnsw.
> > > > > > Instead you can use a function score query based on the vector
> > > > distance.
> > > > > > For larger numbers where hnsw becomes useful, you could try using
> > > > filtered
> > > > > > hnsw, but this will be using a single graph constructed from all
> > of the
> > > > > > documents.
> > > > > >
> > > > > > On Mon, Jun 2, 2025, 5:25 AM Ravikumar Govindarajan <
> > > > > > ravikumar.govindara...@gmail.com> wrote:
> > > > > >
> > > > > > > We use index-sorting to arrange segment data. The ord-ranges for
> > any
> > > > > > given
> > > > > > > KnnVectorField is mutually exclusive
> > > > > > >
> > > > > > > Ex:
> > > > > > > field: content
> > > > > > >
> > > > > > > OrdRange -> 0-100 (User1)
> > > > > > > OrdRange -> 101-300 (User2)
> > > > > > > and so on..
> > > > > > >
> > > > > > > Each OrdRange has to be a self-contained Hnsw graph with all
> > > > neighbours
> > > > > > > strictly inside the given OrdRange. A sub-graph, to be precise..
> > The
> > > > > > > generated segment will contain a lot of these sub-graphs but
> > without
> > > > any
> > > > > > > neighbour links to each other at Level-0.  Level-1 and above can
> > have
> > > > > > > cross-links, which should be fine..
> > > > > > >
> > > > > > > Searches will be based on OrdRange and should stop once the
> > > > sub-graph is
> > > > > > > fully explored and not cross over to other sub-graphs..
> > > > > > >
> > > > > > > I can index them as different fields but it could run into a few
> > > > hundreds
> > > > > > > (if not thousands).
> > > > > > >
> > > > > > > Are there any strategies I can adopt to accomplish this? Can a
> > custom
> > > > > > > VectorScoringFunction solve this? (Like -> assign actual score,
> > if
> > > > ords
> > > > > > are
> > > > > > > in range. Assign 0, if out-of-range etc..)
> > > > > > >
> > > > > > > Is this the correct way of looking at the problem?
> > > > > > >
> > > > > > > Any help is much appreciated
> > > > > > >
> > > > > > > Regards,
> > > > > > > Ravi
> > > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to