Re: On disable_cost

Jonathan S. Katz Fri, 23 Aug 2024 11:20:21 -0700

On 8/23/24 1:11 PM, Robert Haas wrote:

On Fri, Aug 23, 2024 at 11:17 AM Jonathan S. Katz <jk...@postgresql.org> wrote:

We hit an issue with pgvector[0] where a regular `SELECT count(*) FROM
table`[1] is attempting to scan the index on the vector column when
`enable_seqscan` is disabled. Credit to Andrew Kane (CC'd) for flagging it.


I was able to trace this back to e2225346. Here is a reproducer:


If I change EXPLAIN ANALYZE in this test to just EXPLAIN, I get this:

  Aggregate  
(cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00
rows=1 width=8)
    ->  Index Only Scan using test_embedding_idx on test
(cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00
rows=5 width=0)

It took me a moment to wrap my head around this: the cost estimate is
312 decimal digits long. Apparently hnswcostestimate() just returns
DBL_MAX when there are no scan keys because it really, really doesn't
want to do that. Before e2225346, that kept this plan from being
generated because it was (much) larger than disable_cost. But now it
doesn't, because 1 disabled node makes a path more expensive than any
possible non-disabled path. Since that was the whole point of the
patch, I don't feel too bad about it.

I find it a little weird that hnsw thinks itself able to return all
the tuples in an order the user chooses, but unable to return all of

the tuples in an arbitrary order.

For HNSW, "order" is approximated - even when it's returning "in the order the user chooses," the scan is making the best guess at what the correct order is based on the index structure. At the traditional "leaf" level of an index, you're actually traversing a graph-based neighborhood of values. And maybe we could say "Hey, if you get the equivalent of a count(*), just do the count at the bottom layer (Layer 0)" but I think this would be very expensive.

In core, we have precedent for index
types that can't return individual tuples at all (gin, brin) but not
one that is able to return tuples in concept but has a panic attack if
you don't know how you want them sorted. I don't quite see why you
couldn't just treat that case the same as ORDER BY
the_first_column_of_the_index, or any other arbitrary rule that you
want to make up. Sure, it might be more expensive than a sequential
scan, but the user said they didn't want a sequential scan. I'm not
quite sure why pgvector thinks it gets to decide that it knows better
than the user, or the rest of the optimizer. I don't even think I
really believe it would always be worse: I've seen cases where a table
was badly bloated and mostly empty but its indexes were not bloated,
and in that case an index scan can be a HUGE winner even though it
would normally be a lot worse than a sequential scan.

The challenge here is that HNSW is used specifically for approximating ordering; it's not used to directly filter results in the traditional sense (e.g. via. a WHERE clause). It's a bit different than the others mentioned in that regard. However, maybe there are other options to consider here based on this work.

If you don't want to fix hnsw to work the way the core optimizer
thinks it should, or if there's some reason it can't be done,
alternatives might include (1) having the cost estimate function hack
the count of disabled nodes and (2) adding some kind of core support
for an index cost estimator refusing a path entirely. I haven't tested
(1) so I don't know for sure that there are no issues, but I think we
have to do all of our cost estimating before we can think about adding
the path so I feel like there's a decent chance it would do what you
want.


Thanks for the options.

Also, while I did take the initiative to download pgvector and compile
it and hook up a debugger and figure out what was going on here, I'm
not really too sure that's my job. I do think I have a responsibility
to help maintainers of out-of-core extensions who have problems as a
result of my commits, but I also think it's fair to hope that those
maintainers will try to minimize the amount of time that I need to
spend trying to read code that I did not write and do not maintain.
Fortunately, this wasn't hard to figure out, but in a way that's kind
of the point. That DBL_MAX hack was put there by somebody who must've
understood that they were trying to use a very large cost to disable a
certain path shape completely, and it seems to me that if that person
had studied this case and the commit message for e2225346, they would
have likely understood what had happened pretty quickly. Do you think
that's an unfair feeling on my part?

I don't think extension maintainers necessarily have the same level of PostgreSQL internals as you or many of the other people who frequent -hackers, so I think it's fair for them to ask questions or raise issues with patches they don't understand. I was able to glean from the commit message that this was the commit that likely changed the behavior in pgvector, but I can't immediately glean looking through the code as to why. (And using your logic, should an extension maintainer understand the optimizer code when PostgreSQL is providing an interface to the extension maintainer to encapsulate its interactions)?

You can always push back and say "Well, maybe try this, or try that" - which would be a mentoring approach that could push it back on the extension maintainer, which is valid, but I don't see why an extension maintainer can't raise an issue or ask a question here.


Thanks,

Jonathan

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: On disable_cost

Reply via email to