Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1?  That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice.  That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 
2-3% as mentioned on some cases, but initially I also expected worse. Do you 
have any ideas on cases that would suffer the most? I thought the tight inner 
nested loop that I posted in my performance tests would have this index lookup 
as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because 
these improve by a factor 2-3 with the patch). And my theory is that for a 
LIMIT 3, the price paid for this optimization is highest, because it would 
touch the page twice and read all items from it, while only returning three of 
them.

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting.  But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing 
if this is deemed interesting enough by the community. However, I didn't want 
to do this for the first version of this patch, as it would be quite some extra 
work, which would be useless if the idea of the patch itself gets rejected 
already. :-) I'd appreciate any pointers in the right direction - I can take a 
look at how top-N sorting pushes the LIMIT down. Given enough interest for the 
basic idea of this patch, I will implement it.

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
>> descend the tree to a leaf page and read just the first (or possibly two) 
>> tuples. It won't touch the rest of the page yet. If indeed just one tuple 
>> was required, there won't be a call to _bt_next and we're done. If we do 
>> need more than one tuple, _bt_next will resume reading tuples from the index 
>> page at the point where we left off.

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any 
assumptions about this and just assumes the good-weather scenario that 
everything is visible. I'm not sure if it's possible to give an estimation of 
this and whether or not that would be useful. Currently, if it turns out that 
the tuple is not visible, there'll just be another call to _bt_next again which 
will resume reading the page as normal. I'm open to implement any suggestions 
that may improve this.

>> - We need to take into account page splits, insertions and vacuums while we 
>> do not have the read-lock in between _bt_first and the first call to 
>> _bt_next. This made my patch quite a bit more complicated than my initial 
>> implementation.

> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here 
already, to get some input from the experts on this topic what assumptions can 
be made now and in the future. If it turns out that it's completely not 
feasible to do an optimization like this, because of other constraints in the 
btree implementation, then we're done pretty quickly here. :-) For what it's 
worth: the patch at least passes make check consistently - I caught a lot of 
these edge cases related to page splits and insertions while running the 
regression tests, which runs the modified bits of code quite often and in 
parallel. There may be plenty of edge cases left however...

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. 
Cases with a LIMIT seem common enough to me to make it possible to add some 
extra optimizations, especially if that could lead to 2-3x the TPS for these 
kind of queries. However, it indeed needs to be within a reasonable complexity. 
If it turns out that in order for us to optimize this, we need to add a lot of 
extra complexity, it may not be worth it to add it.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is pretty difficult to come up with any faster plans than this 
unfortunately. We have a large database with many tables with timeseries data, 
and when tables get large and when there's an efficient multi-column index and 
when you want to do these kind of time-base or item-based lookups, the nested 
loop is generally the fastest option. I can elaborate more on this, but I'm not 
sure if this thread is the best place for that.

I appreciate you taking the time to take a look at this. I'd be happy to look 
more into any suggestions you come up with. Working on this has taught me a lot 
about the internals of Postgres already - I find it really interesting!

-Floris


Reply via email to