On 23/08/2024 20:11, Robert Haas wrote:
I find it a little weird that hnsw thinks itself able to return all
the tuples in an order the user chooses, but unable to return all of
the tuples in an arbitrary order.
HNSW is weird in many ways:
- There is no inherent sort order. It cannot do "ORDER BY column", only
kNN-sort like "ORDER BY column <-> value".
- It's approximate. It's not guaranteed to return the same set of rows
as a sequential scan + sort.
- The number of results it returns is limited by the hnsw.ef_search GUC,
default 100.
- It collects all the results (up to hnsw.ef_search) in memory, and only
then returns them. So if you tried to use it with a large number of
results, it can simply run out of memory.
Arguably all of those are bugs in HNSW, but it is what it is. The
algorithm is inherently approximate. Despite that, it's useful in practice.
In core, we have precedent for index
types that can't return individual tuples at all (gin, brin) but not
one that is able to return tuples in concept but has a panic attack if
you don't know how you want them sorted.
Well, we do also have gin_fuzzy_search_limit. Two wrongs doesn't make it
right, though; I'd love to get rid of that hack too somehow.
I don't quite see why you
couldn't just treat that case the same as ORDER BY
the_first_column_of_the_index, or any other arbitrary rule that you
want to make up. Sure, it might be more expensive than a sequential
scan, but the user said they didn't want a sequential scan. I'm not
quite sure why pgvector thinks it gets to decide that it knows better
than the user, or the rest of the optimizer. I don't even think I
really believe it would always be worse: I've seen cases where a table
was badly bloated and mostly empty but its indexes were not bloated,
and in that case an index scan can be a HUGE winner even though it
would normally be a lot worse than a sequential scan.
Sure, you could make it work. It could construct a vector out of thin
air to compare with, when there's no scan key, or implement a completely
different codepath that traverses the full graph in no particular order.
If you don't want to fix hnsw to work the way the core optimizer
thinks it should, or if there's some reason it can't be done,
alternatives might include (1) having the cost estimate function hack
the count of disabled nodes and (2) adding some kind of core support
for an index cost estimator refusing a path entirely. I haven't tested
(1) so I don't know for sure that there are no issues, but I think we
have to do all of our cost estimating before we can think about adding
the path so I feel like there's a decent chance it would do what you
want.
It would seem useful for an index AM to be able to say "nope, I can't do
this". I don't remember how exactly this stuff works, but I'm surprised
it doesn't already exist.
--
Heikki Linnakangas
Neon (https://neon.tech)