пн, 5 мар. 2018 г. в 19:48, Ants Aasma <ants.aa...@eesti.ee>:

> On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
> <m...@komzpa.net> wrote:
> >> This approach mixes well with hash
> >> partitioning. It would be neat indeed if PostgreSQL do something
> >> equivalent on its own, and pluggable storage work being done could
> >> enable index organized tables that would help. But you probably need
> >> something right now.
> >
> >
> > Fixing glaring issues (no vacuum and thus no Index-Only Scan on
> append-only
> > tables, vacuum processing all of the eternity of btree) by 11 will get
> most
> > of spike-nails out of the microservice code, and we can probably live
> with
> > them until 11 gets to RDS.
> >
> > I also don't see why a pluggable storage is a must for the clustered
> write.
> > Postgres does have a mechanism for selecting the next page to write tuple
> > to, right now it's just looking at FSM - but what if it just peeked at
> > existing index that already has enough the data to route tuple to correct
> > page on write?
> The mechanism you outlined would likely work for your use case, but it
> has many issues that prevent it from being universally useful. From
> the top of my head:
> * One extra index descent per insertion (I/O for this is necessary
> anyway, but CPU work is duplicated).

This part is going to be needed for any version of such mechanisms.
It is going to only be enabled for CLUSTERed tables only, so most people
won't be affected.

> * We don't currently track the amount of bloat. A mechanism that does
> this needs to be added.

We've got number of tuples and number of pages in relation. We know we're
degraded when number of tuples is going toward number of pages. It seems to
be enough for a good enough cutoff heuristic.

> * If table hits the bloat limit there will be a sudden change in
> behavior. This is pretty nasty from an operations point of view.

We can omit the bloat limit for PoC implementation and see if that is
enough. It may happen that there are other ways to calculate sibling pages
that will handle both ascending and descending insertion.

> * With your (id,ts) clustering and data coming in mostly ordered by
> timestamp, after initial warmup, each page will contain rows from a
> single id, but different ids are arbitrarily interleaved. This is
> better than current state, but people might want to have an
> interleaving step bigger than 8kB to better utilize storage hardware.

8kb is what btree index offers in clustered fashion, and people are okay
with that, isn't it?

When thinking of this I thought "ok, how can we make the current heap
behave like a perfectly-clustered btree, given we've got a
perfectly-clustered btree nearby, but cannot use split operation since we
cannot move a tuple between pages". This was the best I can think of, for
now. Do you have better idea? :)

> * It seems that with a common (ts) clustering and age of timestamp
> coming from an exponential distribution, this will quickly bloat to
> threshold and then insert data in a rather arbitrary order. This is
> much worse than the default behavior.

Can you please explain how is it going to happen? I see that such process
is going to quickly start reusing previous pages, and not reach threshold,
as a page that has enough space to hold the newly coming tuple will be
among the list of pages acquired form index. Clustering will likely not be
perfect but will reorder at least part of the tuples.

Please note that every tuple inserted to heap by FSM is going to drag the
tuples following it into that page, before the limiting mechanism kicks in
because of no space on that page.

Providing an out-of-the-box solution in core PostgreSQL would of
> course be best, but

Let's stop right here. We need out of box solution in core, it would be
best, period. I'll happily discuss out-of-postgres solutions
out-of-postgres list :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Reply via email to