On 11/04/2014 11:01 PM, Petr Jelinek wrote:
On 04/11/14 13:11, Heikki Linnakangas wrote:
On 10/13/2014 01:01 PM, Petr Jelinek wrote:
Only the alloc and reloptions methods are required (and implemented by
the local AM).
The caching, xlog writing, updating the page, etc is handled by backend,
the AM does not see the tuple at all. I decided to not pass even the
struct around and just pass the relevant options because I think if we
want to abstract the storage properly then the AM should not care about
how the pg_sequence looks like at all, even if it means that the
sequence_alloc parameter list is bit long.
Hmm. The division of labour between the seqam and commands/sequence.c
still feels a bit funny. sequence.c keeps track of how many values have
been WAL-logged, and thus usable immediately, but we still call
sequence_alloc even when using up those already WAL-logged values. If
you think of using this for something like a centralized sequence server
in a replication cluster, you certainly don't want to make a call to the
remote server for every value - you'll want to cache them.
With the "local" seqam, there are two levels of caching. Each backend
caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
addition to that, the server WAL-logs 32 values at a time. If you have a
remote seqam, it would most likely add a third cache, but it would
interact in strange ways with the second cache.
Considering a non-local seqam, the locking is also a bit strange. The
server keeps the sequence page locked throughout nextval(). But if the
actual state of the sequence is maintained elsewhere, there's no need to
serialize the calls to the remote allocator, i.e. the sequence_alloc()
calls.
I'm not exactly sure what to do about that. One option is to completely
move the maintenance of the "current" value, i.e. sequence.last_value,
to the seqam. That makes sense from an abstraction point of view. For
example with a remote server managing the sequence, storing the
"current" value in the local catalog table makes no sense as it's always
going to be out-of-date. The local seqam would store it as part of the
am-private data. However, you would need to move the responsibility of
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
provide an API that the seqam can call to do that. Perhaps just let the
seqam call heap_inplace_update on the sequence relation.
My idea of how this works is - sequence_next handles the allocation and
the level2 caching (WAL logged cache) via amdata if it supports it, or
returns single value if it doesn't - then the WAL will always just write
the 1 value and there will basically be no level2 cache, since it is the
sequence_next who controls how much will be WAL-logged, what backend
asks for is just "suggestion".
Hmm, so the AM might return an "nallocated" value less than the "fetch"
value that sequence.c requested? As the patch stands, wouldn't that make
sequence.c write a WAL record more often?
In fact, if the seqam manages the current value outside the database
(e.g. a "remote" seqam that gets the value from another server),
nextval() never needs to write a WAL record.
The third level caching as you say should be solved by the
sequence_request_update and sequence_update callback - that will enable
the sequence AM to handle this kind of caching asynchronously without
blocking the sequence_next unnecessarily.
That way it's possible to implement many different strategies, from no
cache at all, to three levels of caching, with and without blocking of
calls to sequence_next.
It's nice that asynchronous operation is possible, but that seems like a
more advanced feature. Surely an approach where you fetch a bunch of
values when needed is going to be more common. Let's focus on the simple
synchronous case first, and make sure the API works well for that.
For the amdata handling (which is the AM's private data variable) the
API assumes that (Datum) 0 is NULL, this seems to work well for
reloptions so should work here also and it simplifies things a little
compared to passing pointers to pointers around and making sure
everything is allocated, etc.
Sadly the fact that amdata is not fixed size and can be NULL made the
page updates of the sequence relation quite more complex that it used to
be.
It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
and "is_called" to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.
There could be a seqam function that returns a TupleDesc with the
required columns, for example.
Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM?
No, that's not what I meant. The number of catalog tables would be the
same as today. Sequences look much like any other relation, with entries
in pg_attribute catalog table for all the attributes for each sequence.
Currently, all sequences have the same set of attributes, sequence_name,
last_value and so forth. What I'm proposing is that there would a set of
attributes that are common to all sequences, but in addition to that
there could be any number of AM-specific attributes.
It also does not really solve the amdata being dynamic
size "issue".
Yes it would. There would not be a single amdata attribute, but the AM
could specify any number of custom attributes, which could be fixed size
or varlen. It would be solely the AM's responsibility to set the values
of those attributes.
I think just dynamic field where AM stores whatever it
wants is enough (ie the current solution), it's just that the hackery
that sequences storage implementation does is more visible once there
are non-fixed fields.
Using a single amdata field is like creating a table with a single text
column, and putting all your data in the single column separated by
tabs. I'm sure you can make it work, but normalized data is much nicer
to work with.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers