On 05/11/14 13:45, Heikki Linnakangas wrote:
On 11/04/2014 11:01 PM, Petr Jelinek wrote:
On 04/11/14 13:11, Heikki Linnakangas wrote:
On 10/13/2014 01:01 PM, Petr Jelinek wrote:
Only the alloc and reloptions methods are required (and implemented by
the local AM).

The caching, xlog writing, updating the page, etc is handled by
backend,
the AM does not see the tuple at all. I decided to not pass even the
struct around and just pass the relevant options because I think if we
want to abstract the storage properly then the AM should not care about
how the pg_sequence looks like at all, even if it means that the
sequence_alloc parameter list is bit long.

Hmm. The division of labour between the seqam and commands/sequence.c
still feels a bit funny. sequence.c keeps track of how many values have
been WAL-logged, and thus usable immediately, but we still call
sequence_alloc even when using up those already WAL-logged values. If
you think of using this for something like a centralized sequence server
in a replication cluster, you certainly don't want to make a call to the
remote server for every value - you'll want to cache them.

With the "local" seqam, there are two levels of caching. Each backend
caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
addition to that, the server WAL-logs 32 values at a time. If you have a
remote seqam, it would most likely add a third cache, but it would
interact in strange ways with the second cache.

Considering a non-local seqam, the locking is also a bit strange. The
server keeps the sequence page locked throughout nextval(). But if the
actual state of the sequence is maintained elsewhere, there's no need to
serialize the calls to the remote allocator, i.e. the sequence_alloc()
calls.

I'm not exactly sure what to do about that. One option is to completely
move the maintenance of the "current" value, i.e. sequence.last_value,
to the seqam. That makes sense from an abstraction point of view. For
example with a remote server managing the sequence, storing the
"current" value in the local catalog table makes no sense as it's always
going to be out-of-date. The local seqam would store it as part of the
am-private data. However, you would need to move the responsibility of
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
provide an API that the seqam can call to do that. Perhaps just let the
seqam call heap_inplace_update on the sequence relation.

My idea of how this works is - sequence_next handles the allocation and
the level2 caching (WAL logged cache) via amdata if it supports it, or
returns single value if it doesn't - then the WAL will always just write
the 1 value and there will basically be no level2 cache, since it is the
sequence_next who controls how much will be WAL-logged, what backend
asks for is just "suggestion".

Hmm, so the AM might return an "nallocated" value less than the "fetch"
value that sequence.c requested? As the patch stands, wouldn't that make
sequence.c write a WAL record more often?


That's correct, that's also why you usually want to have some form of local caching if possible.

In fact, if the seqam manages the current value outside the database
(e.g. a "remote" seqam that gets the value from another server),
nextval() never needs to write a WAL record.

Sure it does, you need to keep the current state in Postgres also, at least the current value so that you can pass correct input to sequence_alloc(). And you need to do this in crash-safe way so WAL is necessary.

I think sequences will cache in amdata if possible for that type of sequence and in case where it's not and it's caching on some external server then you'll most likely get bigger overhead from network than WAL anyway...


For the amdata handling (which is the AM's private data variable) the
API assumes that (Datum) 0 is NULL, this seems to work well for
reloptions so should work here also and it simplifies things a little
compared to passing pointers to pointers around and making sure
everything is allocated, etc.

Sadly the fact that amdata is not fixed size and can be NULL made the
page updates of the sequence relation quite more complex that it
used to
be.

It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
and "is_called" to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the
required columns, for example.

Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM?

No, that's not what I meant. The number of catalog tables would be the
same as today. Sequences look much like any other relation, with entries
in pg_attribute catalog table for all the attributes for each sequence.
Currently, all sequences have the same set of attributes, sequence_name,
last_value and so forth. What I'm proposing is that there would a set of
attributes that are common to all sequences, but in addition to that
there could be any number of AM-specific attributes.

Oh, that's interesting idea, so the AM interfaces would basically return updated tuple and there would be some description function that returns tupledesc. I am bit worried that this would kill any possibility of ALTER SEQUENCE USING access_method. Plus I don't think it actually solves any real problem - serializing the internal C structs into bytea is not any harder than serializing them into tuple IMHO.


It also does not really solve the amdata being dynamic
size "issue".

Yes it would. There would not be a single amdata attribute, but the AM
could specify any number of custom attributes, which could be fixed size
or varlen. It would be solely the AM's responsibility to set the values
of those attributes.


That's not the issue I was referring to, I was talking about the page replacement code which is not as simple now that we have potentially dynamic size tuple and if tuples were different for different AMs the code would still have to be able to handle that case. Setting the values in tuple itself is not too complicated.

--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to