from:"\"Markus Wanner\""

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-28 Thread Markus Wanner

Haribabu,

On 07.06.2016 07:19, Haribabu Kommi wrote:
>> I have not looked at the spec, but I wouldn't be surprised if there
>> were an upper limit on the length of valid scope names.

Yeah, I didn't find any upper limit, either.

> I am not able to find any link that suggests the maximum length of the scope 
> id.
> From [1], I came to know that, how the scope id is formed in different 
> operating
> systems.
> 
> windows - fe80::3%1
> unix systems - fe80::3%eth0

eth0 may well be interface number 1 on some machine, therefore being
equivalent. However, as discussed before, Postgres cannot know, so it
shouldn't bother.

> I added another character array of 256 member into inet_struct as a last 
> member
> to store the zone id.

I haven't looked at the patch in detail, but zeroing or memcpy'ing those
256 bytes seems like overkill to me. I'd recommend to limit this to only
allocate and move around as many bytes as needed for the scope id.

> Currently all the printable characters are treated as zone id's. I
> will restrict this
> to only alphabets and numbers.

I fear alphanumeric only is too restrictive. RFC 4007 only specifies
that the zone id "must not conflict with the delimiter character" and
leaves everything beyond that to the implementation (which seems too
loose, printable characters sounds about right to me...).

> i will add the zone support for everything and send the patch.

What's currently missing?

> How about the following case, Do we treat them as same or different?
> 
> select 'fe80::%eth1'::inet = 'fe80::%ETH1'::inet;

Let's be consistent in not interpreting the scope id in any way, meaning
those would be different. (After all, interfaces names seem to be case
sensitive - on my variant of Linux at the very least - i.e. ETH1 cannot
be found, while eth1 can be.)

> fe80::%2/64 is only treated as the valid address but not other way as
> fe80::/64%2.
> Do we need to throw an error in this case or just ignore.

I didn't find any evidence for the second case being invalid; nor for it
being valid.

Note, however, that RFC 4007 only gives recommendations for textual
representations (with a "should" preference for the former).

It explicitly "does not specify how the format for non-global addresses
should be combined with the preferred format for literal IPv6 addresses".

Also note that RFC 2732 (Format for Literal IPv6 Addresses in URL's)
doesn't have the '%' sign in the set of reserved characters (not even in
the "unwise" one).

I'm starting to question if it's really wise to add the scope id to the
INET6 type...

> [2] - 
> http://stackoverflow.com/questions/24932172/what-length-can-a-network-interface-name-have

Note that the scope id doesn't necessarily have to be a network
interface name. Concluding there's at max 256 bytes, just because that's
the network interface name's max, doesn't seem correct. However, I agree
that's a reasonable limit for a scope id of the inet6 data type.

Kind Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-03 Thread Markus Wanner

On 06/03/2016 06:55 PM, Tom Lane wrote:
> More importantly,
> on what basis do you conclude that the inet type will only be asked to
> store link-local addresses that are currently valid on the local machine?
> It is not very hard to think of applications where that wouldn't be the
> case.

That's a good point.

> I think a better plan is just to store the zone string verbatim.  It is
> not our job to check its validity or try to determine which spellings
> are equivalent.

Agreed.

That leaves me wondering if it's really worth extending INET, though.
TEXT would be just fine to store a textual scope id. And it makes it
utterly clear that there's no magic involved.

Kind Regards

Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-03 Thread Markus Wanner

On 06/03/2016 12:14 AM, Tom Lane wrote:
> Markus Wanner  writes:
>> I'm even wondering if 'fe80::1%1'::inet = 'fe80::1%2'::inet shouldn't
>> simply yield true. After all, it's the same (non-global) address.
> 
> Surely not?  If the zone_ids didn't mean anything, why would the concept
> even exist?  ISTM that what you've got there is two different addresses
> neither of which is reachable from outside the given machine --- but
> within that machine, they are different.

You are right, both of these addresses actually are considered different
from 'fe80::1', i.e. I cannot ping6 fe80::1, but need to explicitly
specify an interface - either via an additional argument or with the
percent notation.

That's interesting news to me, as this means that only global IPv6
addresses can be stored in 128 bits. For non-global ones that's not
sufficient and 'fe80::1' isn't even considered a valid address by
itself. Checking the headers and manpages, the sockaddr_in6 sports an
uint32_t sin6_scope_id (as per RFC 2553).

Then, there also is the embedded form which uses the 2nd 16-bit word of
the IPv6 address to store the scope id, see for example "8.1.1.3 Scope
Index" in [0]. However, that document also clearly states: "When you
specify scoped address to the command line, NEVER write the embedded
form (such as ff02:1::1 or fe80:2::fedc)".

Given that interfaces are addressed by index internally, I'm pretty sure
the two representations are equivalent (i.e. if eth3 currently happens
to be the 7th interface , then 'fe80::1%eth3' resolves to 'fe80::1%7').

Considering that Postgres is not unlikely to write INET types to
permanent storage, its values should better survive a reboot. And while
I have some doubts about persistence of interface names, those clearly
have a higher chance of surviving a reboot compared to interface
indices. Therefore, I'd advocate resolving interface indices (if given)
to interface names using if_indextoname(3) and let INET types store only
names.

Assuming such an implementation, the following would hold (assuming an
example system as mentioned above):

  'fe80::1%1'::inet != 'fe80::1%2'::inet

but:

  'fe80::1%7'::inet = 'fe80::1%eth3'::inet

I also tend to deny scope ids for global addresses (like ping6), but
allow non-global ('fe80::1') addresses to be stored without an interface
name, as pre-existing applications may rely on that to work (unlike ping6).

Does that sound like a more reasonable plan?

Kind Regards

Markus Wanner

[0]: FreeBSD dev handbook, Section 8.1.1.3: Scope Index:
https://www.freebsd.org/doc/en/books/developers-handbook/ipv6.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-02 Thread Markus Wanner

On 31.05.2016 12:40, Andreas Karlsson wrote:
> On 05/31/2016 04:06 AM, Tom Lane wrote:
>> Unless there's a semantic difference between fe80::1%2/64 and
>> fe80::1/64%2, this doesn't seem like a big deal to me.
> 
> As far as I can till only fe80::1%2/64 is valid, but I am not 100% sure.

According to RFC 4007, Section 11.7 states: "In this combination, it is
important to place the zone index portion before the prefix length when
we consider parsing the format by a name-to-address library function
[11].  That is, we can first separate the address with the zone index
from the prefix length, and just pass the former to the library function."

However, in the sense of being liberal in what you accept, 'fe80::/64%2'
should probably work as well.

Given that a zone_id is a) highly system dependent and b) only ever
meaningful for non-global addresses, I'm wondering what the use case for
storing them is.

I'm even wondering if 'fe80::1%1'::inet = 'fe80::1%2'::inet shouldn't
simply yield true. After all, it's the same (non-global) address.

Kind Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-23 Thread Markus Wanner

On 07/16/2013 09:14 PM, I wrote:
> But okay, you're saying we *have* and *want* a guarantee that even a
> superuser cannot execute arbitrary native code via libpq (at least in
> default installs w/o extensions).

I stand corrected and have to change my position, again. For the record:

We do not have such a guarantee. Nor does it seem reasonable to want
one. On a default install, it's well possible for the superuser to run
arbitrary code via just libpq.

There are various ways to do it, but the simplest one I was shown is:
 - upload a DSO from the client into a large object
 - SELECT lo_export() that LO to a file on the server
 - LOAD it

There are a couple other options, so even if we let LOAD perform
permission checks (as I proposed before in this thread), the superuser
can still fiddle with function definitions. To the point that it doesn't
seem reasonable to try to protect against that.

Thus, the argument against the original proposal based on security
grounds is moot. Put another way: There already are a couple of
"backdoors" a superuser can use. By default. Or with plpgsql removed.

Thanks to Dimitri and Andres for patiently explaining and providing
examples.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-22 Thread Markus Wanner

Dimitri,

On 07/22/2013 08:44 PM, Dimitri Fontaine wrote:
> That's the trade-off we currently need to make to be able to ship with
> the current security protections we're talking about.

Anything wrong with shipping postgis-1.5.so and postgis-2.0.so, as I we
for Debian?

> Ok, here's the full worked out example:
> 
>   ~# select lo_import('/path/to/local/on/the/client/adminpack.so');
>lo_import 
>   ---
>82153
>   (1 row)
>   
>   ~# select lo_export(82153, '/tmp/foo.so');
>lo_export 
>   ---
>1
>   (1 row)
>   
>   ~# LOAD '/tmp/foo.so';
>   LOAD
>   
>   ~# CREATE FUNCTION pg_catalog.pg_file_write(text, text, bool)
>  RETURNS bigint AS '/tmp/foo.so', 'pg_file_write'
>  LANGUAGE C VOLATILE STRICT;
>   CREATE FUNCTION

Good point.

Note that these lo_import() and lo_export() methods - unlike the client
counterparts - read and write to the server's file system.

So, your example is incomplete in that SELECT lo_import() doesn't load
data from the client into the database. But there are many other ways to
do that.

None the less, your point holds: lo_export() is a way for a superuser to
write arbitrary files on the server's file-system via libpq.

>> Or to put it another way: Trusted PLs exist for a good reason. And some
>> people just value security a lot and want that separation of roles.
> 
> Yeah, security is important, but it needs to be addressing real threats
> to have any value whatsoever. We'not talking about adding new big holes
> as extensions containing modules are using LANGUAGE C in their script
> and thus are already restricted to superuser. That part would not
> change.

I agree, it's not a big hole. But with security, every additional layer
counts. I think the next step is to clarify whether or not we want to
provide that guarantee.

>> As someone mentioned previously, $PGDATA may well be mounted noexec, so
>> that seems to be a bad choice.
> 
> I don't understand. You want good default security policy in place, and
> you're saying that using PGDATA allows for a really easy policy to be
> implemented by people who don't want the feature. It seems to me to be
> exactly what you want, why would that be a bad choice then?

I'm just saying that mounting $PGDATA with noexec makes $PGDATA unusable
to load libraries from; i.e. pg_ldopen() just out right fails.

And noexec is another protective layer. Building a solution that
requires $PGDATA to be mounted with exec permissions means that very
solution won't work with that protective layer. Which is a bad thing.

>> Alternatively, we could solve that problem the other way around: Rather
>> than template-ify the DSO, we could instead turn the objects created by
>> the SQL scripts into something that's more linked to the script.
>> Something that would reload as soon as the file on disk changes.
> 
> I think that's the wrong way to do it, because you generally want more
> control over those two steps (preparing the template then instanciating
> it) and you're proposing to completely lose the control and have them be
> a single step instead.

Doesn't versioning take care of that? I.e. if you've version 1.0 created
in your databases and then install 1.1 system wide, that shouldn't
immediately "instantiate".

> Use case: shipping a new plproxy bunch of functions to 256 nodes. You
> want to push the templates everywhere then just do the extension update
> as fast as possible so that it appears to be about 'in sync'.
> 
> Pro-tip: you can use a plproxy function that runs the alter extension
> update step using RUN ON ALL facility.

Contrary use case: you actually *want* to *replace* the created version
installed, as it's a plain security fix that's backwards compatible. A
plain 'apt-get upgrade' would do the job.

Anyways, I think this is bike-shedding: The question of what mental
model fits best doesn't matter too much, here.

I'm concerned about being consistent with whatever mental model we
choose and about an implementation that would work with whatever
security standards we want to adhere to.

>> (Note how this would make out-of-line extensions a lot closer to the
>> in-line variant your recent patch adds? With the dependency between
>> "template" and "instantiation"?)
> 
> Those dependencies are almost gone now except for being able to deal
> with DROP TEMPLATE FOR EXTENSION and guaranteeing we can actually
> pg_restore our pg_dump script.

I appreciate the bug fixes. However, there still is at least one
dependency. And that might be a good thing. Depending on what mental
model you follow.

> I've been changing the implementation to be what you proposed because I
> think it's making more sense (thanks!), and regression tests are
> reflecting that. My github branch is up-to-date with the last changes
> and I'm soon to send an updated patch that will be really as ready for
> commit as possible without a commiter review.

Good, thanks.

>> An attacker having access to a libpq connection with superuser

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-22 Thread Markus Wanner

On 07/22/2013 12:11 AM, Hannu Krosing wrote:
>> Dropping this barrier by installing an untrusted PL (or equally insecure
>> extensions), an attacker with superuser rights can trivially gain
>> root.
> Could you elaborate ?
> 
> This is equivalent to claiming that any linux user can trivially gain root.

Uh. Sorry, you're of course right, the attacker can only gain postgres
rights in that case. Thanks for correcting.

The point still holds. It's another layer that an attacker would have to
overcome.

>>> You already mentioned untrusted PL languages, and I don't see any
>>> difference in between offering PL/pythonu and PL/C on security grounds,
>>> really.
>> I agree. However, this also means that any kind of solution it offers is
>> not a good one for the security conscious sysadmin.
> This is usually the case with a "security conscious sysadmin" - they very
> seldom want to install anything.

Exactly. That's why I'm favoring solutions that don't require any
extension and keep the guarantee of preventing arbitrary native code.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-21 Thread Markus Wanner

Salut Dimitri,

On 07/20/2013 01:23 AM, Dimitri Fontaine wrote:
> Markus Wanner  writes:
>>>   - per-installation (not even per-cluster) DSO availability
>>>
>>> If you install PostGIS 1.5 on a system, then it's just impossible to
>>> bring another cluster (of the same PostgreSQL major version), let
>> On Debian, that should be well possible. Certainly installing Postgres
>> 9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I
>> designed it to be.
>>
>> I think I'm misunderstanding the problem statement, here.
> 
> (of the same PostgreSQL major version)

Not sure what the issue is, here, but I agree that should be possible.

>> Can CREATE EXTENSION check if the standbys have the extension installed?
>> And refuse creation, if they don't?
> 
> No, because we don't register standbies so we don't know who they are,
> and also because some things that we see connected and using the
> replication protocol could well be pg_basebackup or pg_receivexlog.

Can the standby check? In any case, these seem to be problems we can
solve without affecting security.

> Also, it's possible that the standby is only there for High Availability
> purposes and runs no user query.

Requiring the sysadmin to install the extensions there, too, seems
justified to me. Sounds like good advice, anyways.

>> I'm sure you are aware that even without this clear separation of roles,
>> the guarantee means we provide an additional level of security against
>> attackers.
> 
> Given lo_import() to upload a file from the client to the server then
> LOAD with the absolute path where the file ended up imported (or any
> untrusted PL really), this argument carries no sensible weight in my
> opinion.

lo_import() won't write a file for LOAD to load. An untrusted PL (or any
other extension allowing the superuser to do that) is currently required
to do that.

Or to put it another way: Trusted PLs exist for a good reason. And some
people just value security a lot and want that separation of roles.

>> None the less, the "safe by default" has served us well, I think.
> 
> That's true. We need to consider carefully the proposal at hand though.
> 
> It's all about allowing the backend to automatically load a file that it
> finds within its own $PGDATA so that we can have per-cluster and
> per-database modules (DSO files).

As someone mentioned previously, $PGDATA may well be mounted noexec, so
that seems to be a bad choice.

> The only difference with before is the location where the file is read
> from, and the main security danger comes from the fact that we used to
> only consider root-writable places and now propose to consider postgres
> bootstrap user writable places.

FWIW, I only proposed to let postgres check write permissions on
libraries it loads. IIUC we don't currently do that, yet. And Postgres
happily loads a world-writable library, ATM.

> Having the modules in different places in the system when it's a
> template and when it's instanciated allows us to solve a problem I
> forgot to list:
> 
>   - upgrading an extension at the OS level
> 
> Once you've done that, any new backend will load the newer module
> (DSO file), so you have to be real quick if installing an hot fix in
> production and the SQL definition must be changed to match the new
> module version…

I agree, that's a problem.

Alternatively, we could solve that problem the other way around: Rather
than template-ify the DSO, we could instead turn the objects created by
the SQL scripts into something that's more linked to the script.
Something that would reload as soon as the file on disk changes.

(Note how this would make out-of-line extensions a lot closer to the
in-line variant your recent patch adds? With the dependency between
"template" and "instantiation"?)

> With the ability to "instanciate" the module in a per-cluster
> per-database directory within $PGDATA the new version of the DSO module
> would only put in place and loaded at ALTER EXTENSION UPDATE time.
> 
> I'm still ok with allowing to fix those problems only when a security
> option that defaults to 'false' has been switched to 'true', by the way,
> so that it's an opt-in,

Okay, good.

For the issues you raised, I'd clearly prefer fixes that maintain
current security standards, though.

> but I will admit having a hard time swallowing
> the threat model we're talking about…

An attacker having access to a libpq connection with superuser rights
cannot currently run arbitrary native code. He can try a DOS by
exhausting system resources, but that's pretty far from being
invisible.

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-17 Thread Markus Wanner

On 07/17/2013 08:52 PM, Dimitri Fontaine wrote:
> the next step of this discussion should be about talking about the
> problems we have and figuring out *if* we want to solve them, now that
> we managed to explicitely say what we want as a project.
> 
>   - per-installation (not even per-cluster) DSO availability
> 
> If you install PostGIS 1.5 on a system, then it's just impossible to
> bring another cluster (of the same PostgreSQL major version), let
> alone database, with PostGIS 2.x, even for migration assessment
> purposes. The "By Design™" part is really hard to explain even to
> security concious users.

On Debian, that should be well possible. Certainly installing Postgres
9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I
designed it to be.

On distributions that do not allow parallel installation of multiple
Postges major versions, it's certainly not the extension's fault.

I think I'm misunderstanding the problem statement, here.

>   - hot standby and modules (aka DSO)
> 
> As soon as you use some functions in 'language C' you need to
> carefully watch your external dependencies ahead of time. If you do
> CREATE EXTENSION hstore;, create an hstore column and a GiST index
> on it, then query the table on the standby… no luck. You would tell
> me that it's easy enough to do and that it's part of the job, so see
> next point.

Agreed, that's an area where Postgres could do better. I'd argue this
should be possible without relaxing the security guarantees provided,
though. Because there likely are people wanting both.

Can CREATE EXTENSION check if the standbys have the extension installed?
And refuse creation, if they don't?

>   - sysadmin vs dba, or PostgreSQL meets the Cloud
> 
> The current model of operations is intended for places where you
> have separate roles: the sysadmin cares about the OS setup and will
> provide with system packages (compiled extensions and the like), and
> DBA are never root on the OS. They can CREATE EXTENSION and maybe
> use the 'postgres' system account, but that's about it.

I'm sure you are aware that even without this clear separation of roles,
the guarantee means we provide an additional level of security against
attackers.

> Given the current raise of the Cloud environements and the devops
> teams, my understanding is that this model is no longer the only
> one. Depending on who you talk to, in some circles it's not even a
> relevant model anymore: user actions should not require the
> intervention of a "sysadmin" before hand.
> 
> While I appreciate that many companies still want the old behavior
> that used to be the only relevant model of operations, I think we
> should also provide for the new one as it's pervasive enough now for
> us to stop ignoring it with our "I know better" smiling face.

I'd even think it's a minority who actually uses the guarantee we're
talking about. Mostly because of the many and wide spread untrusted PLs
(which undermine the guarantee). And thus even before the rise of the cloud.

None the less, the "safe by default" has served us well, I think.

> Now it should be possible to solve at least some of those items while
> still keeping the restriction above, or with an opt-in mechanism to
> enable the "works by default, but you have to solve the security
> implications yourself" behaviour. A new GUC should do it, boolean,
> defaults false:
> 
>   runs_in_the_cloud_with_no_security_concerns = false

[ I usually associate "cloud" with (increased) "security concerns", but
  that's an entirely different story. ]

> I don't think the relaxed behaviour we're talking about is currently
> possible to develop as an extension, by the way.

It's extensions that undermine the guarantee, at the moment. But yeah,
it depends a lot on what kind of "relaxed behavior" you have in mind.

>> Andres made two contrib-free suggestions: with COPY TO BINARY, you get a
> 
> Well, what about using lo_import()?

That only reads from the file-system. You probably meant lo_export(),
which is writing. Although not on the server's, but only on the (libpq)
client's file-system. No threat to the server.

> Yes it's dangerous. It's also solving real world problems that I see no
> other way to solve apart from bypassing the need to ever load a DSO
> file, that is embedding a retargetable C compiler in the backend.

If the sysadmin wants to disallow arbitrary execution of native code to
postgres (the process), any kind of embedded compiler likely is equally
unwelcome.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-16 Thread Markus Wanner

On 07/16/2013 01:27 AM, Robert Haas wrote:
> Andres points out that you can install adminpack to obtain
> local filesystem access, and that is true.  But the system
> administrator can also refuse to allow adminpack, and/or use selinux
> or other mechanisms to prevent the postgres binary from writing a file
> with execute permissions.

I think execute permissions (on the FS) are irrelevant. It's about
loading a shared library. The noexec mount option can prevent that, though.

But okay, you're saying we *have* and *want* a guarantee that even a
superuser cannot execute arbitrary native code via libpq (at least in
default installs w/o extensions).

Andres made two contrib-free suggestions: with COPY TO BINARY, you get a
header prepended, which I think is sufficient to prevent a dlopen() or
LoadLibrary(). Text and CSV formats of COPY escape their output, so it's
hard to write \000 or other control bytes. ESCAPE and DELIMITER also
have pretty restrictive requirements. So COPY doesn't seem quite "good"
enough to write a valid DSO.

His second suggestion was tuplesort tapes. tuplesort.c says: "We require
the first "unsigned int" of a stored tuple to be the total size on-tape
of the tuple...". That's kind of a header as well. Writing a proper DSO
certainly does not sound trivial, either.

>From a security perspective, I wouldn't want to rely on that guarantee.
Postgres writes too many files to be sure none of those can be abused to
write a loadable DSO, IMO.

Mounting $PGDATA 'noexec' and allowing the postgres user to write only
to such noexec mounts sounds like a good layer. It's independent, though
- it can be used whether or not the above guarantee holds.

> Things aren't quite so bad if we write the bits to a file first and
> then dynamically load the file.  That way at least noexec or similar
> can provide protection.  But it still seems like a pretty dangerous
> direction.

I agree now. Thanks for elaborating.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

Robert,

On 07/15/2013 12:12 PM, Markus Wanner wrote:
> On 07/15/2013 05:49 AM, Robert Haas wrote (in a slightly different order):
>> There is a lot of
>> (well-founded, IMHO) resistance to the idea of letting users install C
>> code via an SQL connection.
> 
> Nowhere did I propose anything close to that. I'm in full support of the
> resistance. Pitchforks and torches ready to rumble. :-)

To clarify: In the above agreement, I had the Postgres level ordinary
users in mind, who certainly shouldn't ever be able to upload and run
arbitrary native code.

I'm not quite as enthusiastic if you meant the system level user that
happens to have superuser rights on Postgres. As Andres pointed out,
there are some more holes to plug, if you want to lock down a superuser
account. [1]

I'm not quite clear what kind of "user" you meant in your statement above.

Regards

Markus Wanner

[1]: I see the theoretical benefits of providing yet another layer of
security. Closing the existing holes would require restricting LOAD, but
that seems to suffice (given the sysadmin makes sure he doesn't
accidentally provide any of the harmful extensions).

How about the (optional?) ability to restrict LOAD to directories and
files that are only root writable? Is that enough for a root to disallow
the untrusted Postgres superuser to run arbitrary native code? Or does
that still have other holes? How many real use cases does it break? How
many does it fix?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

On 07/15/2013 12:19 PM, Andres Freund wrote:
> On 2013-07-15 12:12:36 +0200, Markus Wanner wrote:
>> Granted, the "internalization" of the DSO is somewhat critical to
>> implement. Running as a non-root user, Postgres has less power than that
>> and can certainly not protect the internalized DSO from modification by
>> root. It can, however, store the DSO well protected from other users
>> (e.g. in a directory within $PGDATA).
> 
> There's heaps of problems with that though. Think noexec mounts or selinux.

Good point.

Note, however, that "internalize" doesn't necessarily mean exactly that
approach. It could be yet another directory, even system wide, which any
distribution should well be able to prepare.

I intentionally left the "internalizing" somewhat undefined. It could
even be more equivalent to what is done with the SQL stuff, i.e. some
system catalog. But that just poses other implementation issues.
(Loading a DSO from memory doesn't sound very portable to me.)

>> Relying on the external DSO only exactly once can provide additional
>> safety.
> 
> Not necessarily. Think of a distribution provided upgrade with a
> security fix in an extension.

Ugh.. only to figure out the patched DSO is incompatible to the old
version of the SQL scripts? And therefore having to re-create the
extension, anyways? That only highlights why this difference in
treatment of SQL and DSO is troublesome.

> On a machine with dozens of clusters. Now
> you need to make sure each and every one of them has the new .so.

Agreed.

So this sounds like the other approach to unification may be more
useful: Linking the SQL scripts better and make them (re-)load from the
extensions directory, just like the DSOs.

> I think protecting against the case where such directories are writable
> is a rather bad argument.

I agree. That's why I'm neglecting he security implications and stated
there are none.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

On 07/15/2013 12:05 PM, Andres Freund wrote:
> A superuser can execute native code as postges user. That's it.

Oh, I though Robert meant postgres users, i.e. non-superusers.

I hereby withdraw my pitchforks: I'm certainly not opposing to simplify
the life of superusers, who already have the power.

> Now, I don't think Markus proposal is a good idea on *other* grounds
> though... but that's for another email.

Separation of concerns, huh? Good thing, thanks :-)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

Robert,

thanks for answering. I think you responded to the (or some) idea in
general, as I didn't see any relation to the quoted part. (Otherwise
this is a hint that I've not been clear enough.)

On 07/15/2013 05:49 AM, Robert Haas wrote (in a slightly different order):
> There is a lot of
> (well-founded, IMHO) resistance to the idea of letting users install C
> code via an SQL connection.

Nowhere did I propose anything close to that. I'm in full support of the
resistance. Pitchforks and torches ready to rumble. :-)

> The security problems with this model have been discussed in every
> email thread about the extension template feature.

I read through a lot of these discussions, recently, but mostly found
misunderstanding and failure to distinguish between available extension
(template) and created extension (instantiation). I think this is a new
proposal, which I didn't ever see discussed. It does not have much, if
anything, to do with Dimitri's patch, except for it having made the
point obvious to me, as it continues to mix the two mental models.

I don't see much of a difference security wise between loading the DSO
at extension creation time vs. loading it at every backend start. More
details below.

My major gripe with the current way extensions are handled is that SQL
scripts are treated as a template, where as the DSO is only "pointed
to". Changing the SQL scripts in your extension directory has no effect
to the installed extension. Modifying the DSO file certainly has. That's
the inconsistency that I'd like to get rid of.


A security analysis of the proposal follows.

Granted, the "internalization" of the DSO is somewhat critical to
implement. Running as a non-root user, Postgres has less power than that
and can certainly not protect the internalized DSO from modification by
root. It can, however, store the DSO well protected from other users
(e.g. in a directory within $PGDATA).

Relying on the external DSO only exactly once can provide additional
safety. Consider an extensions directory that's accidentally
world-writable. As it stands, an attacker can modify the DSO and gain
control as soon as a new connection loads it. With my proposal, the
attacker would have to wait until CREATE EXTENSION. A point in time
where the newly created extension is more likely to be tested and
cross-checked.

I'm arguing both of these issues are very minor and negligible, security
wise. Baring other issues, I conclude there's no security impact by this
proposal.

Arguably, internalizing the DSO (together with the SQL) protects way
better from accidental modifications (i.e. removing the DSO by
un-installing the extension template via the distribution's packaging
system while some database still has the extension instantiated). That
clearly is a usability and not a security issue, though.

If nothing else (and beneficial in either case): Postgres should
probably check the permissions on the extension directory and at least
emit a warning, if it's world-writable. Or refuse to create (or even
load) an extension, right away.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-10 Thread Markus Wanner

Peter,

On 07/09/2013 11:04 PM, Peter Eisentraut wrote:
> I think there is an intrinsic conflict here.  You have things inside the
> database and outside.  When they depend on each other, it gets tricky.
> Extensions were invented to copy with that.  They do the job, more or
> less.

I agree. And to extend upon that, I think it's important to distinguish
between the created extension and the available one, i.e. the template.
Only the template lives outside. The created extension itself is firmly
sitting in the database, possibly with multiple dependencies from other
objects. It does not dependent on anything outside of the database
(assuming the absence of a DSO of the extension, which does not follow
that template concept).

And yes, we decided the objects that are part of the extension should
not get dumped with pg_dump. Nobody argues to change that. Note,
however, that this very decision is what raises the "intrinsic conflict"
for pg_restore, because CREATE EXTENSION in the dump depends on the
outside extension. If anything, Dimitri's patch solves that.

> Now you want to take the same mechanism and apply it entirely
> inside the database.  But that wasn't the point of extensions!  That's
> how you get definitional issues like, should extensions be dumped or not.

IMO the point of extensions is to extend Postgres (with code that's not
part of core). Whether their templates (SQL sources, if you want) are
stored on the file system (outside) or within the database is irrelevant
to the concept.

Think of it that way: Take one of those FUSE-Postgres-FS things [1],
which uses Postgres as the backend for a file system and allows you to
store arbitrary files. Mount that to the extensions directory of your
Postgres instance and make your extension templates available there
(i.e. copy them there). CREATE EXTENSION would just work, reading the
templates for the extension to create from itself, via that fuse
wrapper. (If the FUSE wrapper itself was using an extension, you'd get
into an interesting chicken and egg problem, but even that would be
resolvable, because the installed extension doesn't depend on the
template it was created from.)

Think of Dimitri's patch as a simpler and more elegant way to achieve
the very same thing. (Well, modulo our disagreement about the dependency
between extension and templates.) And as opposed to the file system or
fuse approach, you'd even gain transactional safety, consistency (i.e. a
constraint can enforce a full version exists as the basis for an upgrade
script), etc... But who am I to tell you the benefits of storing data in
a database?

Of course, you then also want to be able to backup your templates (not
the extensions) stored in the database. Just like you keep a backup of
your file-system templates. Either by simply making a copy, or maybe by
keeping an RPM or DEB package of it available. Thus, of course,
templates for extensions need to be dumped as well.

> I don't believe the above use case.  (Even if I did, it's marginal.)
> You should always be able to arrange things so that an upgrade of an
> inside-the-database-package is possible before or after pg_restore.

Dimitri's scenario assumes an old and a new version of an extension as
well as an old and a new Postgres major version. Where the old extension
is not compatible with the new Postgres major version. Which certainly
seems like a plausible scenario to me (postgis-2.0 is not compatible
with Postgres-9.3, for example - granted, it carries a DSO, so it's not
really a good example).

Given how extensions work, to upgrade to the new Postgres major version
*and* to the required new version of the extension, you don't ever need
to "upgrade an inside-the-database-package". Instead, you need to:

createdb -> provide templates -> CREATE EXTENSION -> restore data

Now, CREATE EXTENSION and restoring your data has effectively been
merged for the user, as pg_dump emits proper CREATE EXTENSION commands.
"Providing templates" so far meant installing an RPM or DEB. Or copying
the files manually.

But in fact, how and where you provide templates for the extension is
irrelevant to that order. And the possibility to merge the second step
into the 'restore data' step certainly sounds appealing to me.

> pg_dump and pg_restore are interfaces between the database and the
> outside. They should have nothing to do with upgrading things that live
> entirely inside the database.

I don't get your point here. In my view, libpq is intended to modify the
things that live inside the database, including extensions (i.e. ALTER
EXTENSION ADD FUNCTION). Or what kind of "things that live entirely
inside the database" do you have in mind.

> There would be value to inside-the-database package management, but it
> should be a separate concept.

Anything that's

Re: [HACKERS] Review: extension template

2013-07-09 Thread Markus Wanner

Salut Dimitri,

On 07/09/2013 12:40 PM, Dimitri Fontaine wrote:
> Markus Wanner  writes:
>> Or how do you think would pg_restore fail, if you followed the mental
>> model of the template?
> 
>   # create template for extension foo version 'x' as '';
>   # create extension foo;
>   # alter template for extension foo rename to bar;
> 
>   $ pg_dump | psql
> 
> And now it's impossible to CREATE EXTENSION foo, because there's no
> source to install it from available. I think we should actively prevent
> that scenario to happen in the field (current patch prevents it).

I see. You value that property a lot.

However, I don't think the plain ability to create an extension is quite
enough to ensure a consistent restore, though. You'd also have to ensure
you recreate the very *same* contents of the extension that you dumped.

Your patch doesn't do that. It seems to stop enforcing consistency at
some arbitrary point in between. For example, you can ALTER a function
that's part of the extension. Or ALTER TEMPLATE FOR EXTENSION while an
extension of that version is installed.

> Usually what you do when you want to change an extension is that you
> provide an upgrade script then run it with ALTER EXTENSION UPDATE.

As a developer, I often happen to work on one and the same version, but
test multiple modifications. This ability to change an extension
"on-the-fly" seems pretty valuable to me.

> The current file system based extensions allow you to maintain
> separately the files foo--1.0.sql and foo--1.2.sql, and you don't need
> to copy a current version of the whole extension away before hacking the
> new version.
>
> The Extension Template facility in the current patch allows you to do
> much the same

Sure. I'm aware of that ability and appreciate it.

> I see that you've spent extra time and effort to better understand any
> community consensus that might exist around this patch series, and I
> want to say thank you for that!

Thank you for your patience in pursuing this and improving user
experience with extensions. I really appreciate this work.

> Basically what I'm saying in this too long email is that I need other
> contributors to voice-in.

I fully agree. I don't want to hold this patch back any further, if it's
just me.

>> I really recommend to rename the feature (and the
>> commands), in that case, though. We may rather call the existing
>> file-system thingie an "extension template", instead, as it becomes a
>> good differentiator to what you're proposing.
> 
> Any proposals?

"Inline extension" is a bit contradictory. Maybe "managed extension"
describes best what you're aiming for.

In a similar vein, "out-of-line extension" sounds redundant to me, so
I'd rather characterize the file-system thingie as "extension templates"
wherever a clear distinction between the two is needed.

>> How about ALTER EXTENSION ADD (or DROP)? With the link on the
>> "template", you'd have to prohibit that ALTER as well, based on the
>> exact same grounds as the RENAME: The installed extension would
>> otherwise differ from the "template" it is linked to.
> 
> You're "supposed" to be using that from within the template scripts
> themselves. The main use case is the upgrade scripts from "unpackaged".

Agreed. The documentation says it's "mainly useful in extension update
scripts".

> I could see foreclosing that danger by enforcing that creating_extension
> is true in those commands.

For managed extensions only, right? It's not been prohibited for
extensions so far.

>> See how this creates an animal pretty different from the current
>> extensions? And IMO something that's needlessly restricted in many ways.
> 
> Well really I'm not convinced. If you use ALTER EXTENSION ADD against an
> extension that you did install from the file system, then you don't know
> what will happen after a dump and restore cycle, because you didn't
> alter the files to match what you did, presumably.

Yeah. The user learned to work according to the template model. Maybe
that was not the best model to start with. And I certainly understand
your desire to ensure a consistent dump & restore cycle. However...

> If you do the same thing against an extension that you did install from
> a catalog template, you just managed to open yourself to the same
> hazards… 

... I think the current patch basically just moves the potential
hazards. Maybe these moved hazards are less dramatic and justify the
move. Let's recap:

In either case (template model & link model) the patch:

 a) guarantees to restore the templa

Re: [HACKERS] Review: extension template

2013-07-09 Thread Markus Wanner

Dimitri,

leaving the template vs link model aside for a moment, here are some
other issues I run into. All under the assumption that we want the link
model.

On 07/08/2013 11:49 AM, Dimitri Fontaine wrote:
> Please find attached to this mail version 9 of the Extension Templates
> patch with fixes for the review up to now.

First of all, I figured that creation of a template of a newer version
is prohibited in case an extension exists:

> db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $foo$ SELECT 1; 
> $foo$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# CREATE EXTENSION foo;
> CREATE EXTENSION
> db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.1' AS $foo$ SELECT 2; 
> $foo$;
> ERROR:  extension "foo" already exists

Why is that?

I then came to think of the upgrade scripts... what do we link against
if an extension has been created from some full version and then one or
more upgrade scripts got applied?

Currently, creation of additional upgrade scripts are not blocked. Which
is a good thing, IMO. I don't like the block on newer full versions.
However, the upgrade doesn't seem to change the dependency, so you can
still delete the update script after the update. Consider this:

> db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# CREATE EXTENSION foo;
> CREATE EXTENSION
> db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# ALTER EXTENSION foo UPDATE TO '0.1';
> ALTER EXTENSION
> db1=# SELECT * FROM pg_extension;
>  extname | extowner | extnamespace | extrelocatable | extversion | extconfig 
> | extcondition 
> -+--+--+++---+--
>  plpgsql |   10 |   11 | f  | 1.0|   
> | 
>  foo |   10 | 2200 | f  | 0.1|   
> | 
> (2 rows)
> 
> db1=# DROP TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1';
> DROP TEMPLATE FOR EXTENSION
>> db1=# SELECT * FROM pg_extension;
>>  extname | extowner | extnamespace | extrelocatable | extversion | extconfig 
>> | extcondition 
>> -+--+--+++---+--
>>  plpgsql |   10 |   11 | f  | 1.0|   
>> | 
>>  foo |   10 | 2200 | f  | 0.1|   
>> | 
>> (2 rows)
>> 

In this state, extension foo as of version '0.1' is installed, but
running this through dump & restore, you'll only get back '0.0'.

Interestingly, the following "works" (in the sense that the DROP of the
upgrade script is prohibited):

> db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# CREATE EXTENSION foo VERSION '0.1';
> CREATE EXTENSION
> db1=# DROP TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1';
> ERROR:  cannot drop update template for extension foo because other objects 
> depend on it
> DETAIL:  extension foo depends on control template for extension foo
> HINT:  Use DROP ... CASCADE to drop the dependent objects too.

However, in that case, you are free to drop the full version:

> db1=# DROP TEMPLATE FOR EXTENSION foo VERSION '0.0';
> DROP TEMPLATE FOR EXTENSION

This certainly creates a bad state that leads to an error, when run
through dump & restore.

Maybe this one...

> db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
> CREATE TEMPLATE FOR EXTENSION
> db1=# DROP TEMPLATE FOR EXTENSION foo VERSION '0.0';
> DROP TEMPLATE FOR EXTENSION

... should already err out here ...

> db1=# CREATE EXTENSION foo;
> ERROR:  Extension "foo" is not available from 
> "/tmp/pginst/usr/local/pgsql/share/extension" nor as a template
> db1=# CREATE EXTENSION foo VERSION '0.1';
> ERROR:  Extension "foo" is not available from 
> "/tmp/pginst/usr/local/pgsql/share/extension" nor as a template

... and not only here.

I.e. the TO version should probably have a dependency on the FROM
version (that might even be useful in the template model).

Another thing that surprised me is the inability to have an upgrade
script *and* a full version (for the same extension target version).
Even if that's intended behavior, the e

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

Ian,

On 07/05/2013 07:28 PM, ivan babrou wrote:
> - /*
> -  * Rounding could cause connection to fail; need at 
> least 2 secs
> -  */

You removed this above comment... please check why it's there. The
relevant revision seems to be:

###
commit 2908a838ac2cf8cdccaa115249f8399eef8a731e
Author: Tom Lane 
Date:   Thu Oct 24 23:35:55 2002 +

Code review for connection timeout patch.  Avoid unportable assumption
that tv_sec is signed; return a useful error message on timeout failure;
honor PGCONNECT_TIMEOUT environment variable in PQsetdbLogin; make code
obey documentation statement that timeout=0 means no timeout.
###

> - if (timeout < 2)
> - timeout = 2;
> - /* calculate the finish time based on start + timeout */
> - finish_time = time(NULL) + timeout;
> + gettimeofday(&finish_time, NULL);
> + finish_time.tv_usec += (int) timeout_usec;

I vaguely recall tv_usec only being required to hold values up to
100 by some standard. A signed 32 bit value would qualify, but only
hold up to a good half hour worth of microseconds. That doesn't quite
seem enough to calculate finish_time the way you are proposing to do it.

> + finish_time.tv_sec  += finish_time.tv_usec / 100;
> + finish_time.tv_usec  = finish_time.tv_usec % 100;
>   }
>   }
>  
> @@ -1073,15 +1074,15 @@ pqSocketPoll(int sock, int forRead, int forWrite, 
> time_t end_time)
>   input_fd.events |= POLLOUT;
>  
>   /* Compute appropriate timeout interval */
> - if (end_time == ((time_t) -1))
> + if (end_time == NULL)
>   timeout_ms = -1;
>   else
>   {
> - time_t  now = time(NULL);
> + struct timeval now;
> + gettimeofday(&now, NULL);
>  
> - if (end_time > now)
> - timeout_ms = (end_time - now) * 1000;
> - else
> + timeout_ms = (end_time->tv_sec - now.tv_sec) * 1000 + 
> (end_time->tv_usec - now.tv_usec) / 1000;

I think that's incorrect on a platform where tv_sec and/or tv_usec is
unsigned. (And the cited commit above indicates there are such platforms.)

On 07/09/2013 02:25 PM, ivan babrou wrote:
> There's no complexity here :)

Not so fast, cowboy...  :-)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

On 07/09/2013 09:15 AM, ivan babrou wrote:
> Database server lost network — boom, 2 seconds delay. What's the point then?

Oh, I see. Good point. It could still improve connection time during
normal operation, though.

None the less, I now agree with you: we recommend a pooler, which may be
capable of millisecond timeouts, but arguably is vastly more complex
than the proposed patch. And it even brings its own set of gotchas (lots
of connections). I guess I don't quite buy the complexity argument, yet.

Sure, gettimeofday() is subject to clock adjustments. But so is time().
And if you're setting timeouts that low, you probably know what you're
doing (or at least care about latency a lot). Or is gettimeofday() still
considerably slower on certain architectures or in certain scenarios?
Where's the complexity?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

On 07/08/2013 08:31 PM, ivan babrou wrote:
> Seriously, I don't get why running 150 poolers is easier.

Did you consider running pgbouncer on the database servers?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

Hello Dimitri,

On 07/08/2013 11:49 AM, Dimitri Fontaine wrote:
> Please find attached to this mail version 9 of the Extension Templates
> patch with fixes for the review up to now.

Thanks, cool.

> Markus Wanner  writes:
>>> I still think that we shouldn't allow creating a template for an
>>> extension that is already installed, though. Do you have any suggestions
>>> for a better error message?
>>
>> If we go for the template model, I beg to differ. In that mind-set, you
>> should be free to create (or delete) any kind of template without
>> affecting pre-existing extensions.
> 
> Then what happens at pg_restore time? the CREATE EXTENSION in the backup
> script will suddenly install the other extension's that happen to have
> the same name? I think erroring out is the only safe way here.

Extensions are commonly identified by name (installed ones as well as
available ones, i.e. templates).

Thus I think if a user renames a template, he might have good reasons to
do so. He likely *wants* it to be a template for a different extension.
Likewise when (re-)creating a template with the very same name of a
pre-existing, installed extension.

Maybe the user just wanted to make a "backup" of the template prior to
modifying it. If he then gives the new template the same name as the old
one had, it very likely is similar, compatible or otherwise intended to
replace the former one.

The file-system templates work exactly that way (modulo DSOs). If you
create an extension, then modify (or rename and re-create under the same
name) the template on disk, then dump and restore, you end up with the
new version of it. That's how it worked so far. It's simple to
understand and use.

> It's all about pg_restore really, but it's easy to forget about that and
> get started into other views of the world. I'll try to be better at not
> following those tracks and just hammer my answers with "pg_restore" now.

That's unlikely to be of much help. It's not like pg_restore would stop
to work. It would just work differently. More like the file-system
templates. More like the users already knows and (likely) expects it.
And more like the "template" model that you advertise.

Or how do you think would pg_restore fail, if you followed the mental
model of the template?

>> In any case, I'm arguing this "template" renaming behavior (and the
>> subsequent error) are the wrong thing to do, anyways. Because an
>> extension being linked to a parent of a different name is weird and IMO
>> not an acceptable state.
> 
> Yes, you're right on spot here. So I've amended the patch to implement
> the following behavior (and have added a new regression test):
> 
>   # CREATE TEMPLATE FOR EXTENSION foo VERSION 'v' AS '';
>   # CREATE EXTENSION foo;
>   # ALTER TEMPLATE FOR EXTENSION foo RENAME TO bar;
>   ERROR:  55006: template for extension "foo" is in use
>   DETAIL:  extension "foo" already exists
>   LOCATION:  AlterExtensionTemplateRename, template.c:1040
>   STATEMENT:  ALTER TEMPLATE FOR EXTENSION foo RENAME TO bar;

Okay, good, this prevents the strange state.

However, this also means you restrict the user even further... How can
he save a copy of an existing template, before (re-)creating it with
CREATE TEMPLATE FOR EXTENSION?

On the file system, a simple cp or mv is sufficient before
(re)installing the package from your distribution, for example.

>> I bet that's because people have different mental models in mind. But I
>> probably stressed that point enough by now...
> 
> FWIW, I do agree.

Good. Why do you continue to propose the link model?

> But my understanding is that the community consensus
> is not going that way.

What way? And what community consensus?

Re-reading some of the past discussions, I didn't find anybody voting
for a dependency between the template and the created extension. And at
least Tom pretty clearly had the template model in mind, when he wrote
[1]: "We don't want it to look like manipulating a template has anything
to do with altering an extension of the same name (which might or might
not even be installed)." or [2]: "But conflating this functionality
[i.e. extension templates] with installed extensions is just going to
create headaches."

The closest I found was Robert Haas mentioning [3], that "[he doesn't]
see a problem having more than one kind of extensions". However, please
mind the context. He doesn't really sound enthusiastic, either.

I'm puzzled about some of your words in that thread. In the very message
Robert responded to, you wrote [4]: "Having more than one way to ship an
extension is good, having two different animals with two

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

On 07/08/2013 10:20 AM, Dimitri Fontaine wrote:
> Bypassing the file system entirely in order to install an extension. As
> soon as I figure out how to, including C-coded extensions.

Do I understand correctly that you want to keep the extensions (or their
templates) out of the dump and require the user to "upload" it via libpq
prior to the restor; instead of him having to install them via .deb or .rpm?

This would explain why you keep the CREATE TEMPLATE FOR EXTENSION as a
separate step from CREATE EXTENSION. And why you, too, insist on wanting
templates, and not just a way to create an extension via libpq.

However, why don't you follow the template model more closely? Why
should the user be unable to create a template, if there already exists
an extension of the same name? That's an unneeded and disturbing
limitation, IMO.

My wish: Please drop the pg_depend link between template and extension
and make the templates shared across databases. So I also have to
install the template only once per cluster. Keep calling them templates,
then. (However, mind that file-system extension templates are templates
as well. In-line vs. out-of-line templates, if you want.)

I think you could then safely allow an upgrade of an extension that has
been created from an out-of-line template by an upgrade script that
lives in-line. And vice-versa. Just as an example. It all gets nicer and
cleaner, if the in-line thing better matches the out-of-line one, IMO.

An extension should look and behave exactly the same, independent of
what kind of template it has been created from. And as we obviously
cannot add a pg_depend link to a file on the file system, we better
don't do that for the in-line variant, either, to maintain the symmetry.

> The main feature of the extensions system is its ability to have a clean
> pg_restore process even when you use some extensions. That has been the
> only goal of the whole feature development.

Great! Very much appreciated.

> Let me stress that the most important value in that behavior is to be
> able to pg_restore using a newer version of the extension, the one that
> works with the target major version. When upgrading from 9.2 to 9.3 if
> you depend on keywords that now are reserved you need to install the
> newer version of the extension at pg_restore time.
> 
> The main features I'm interested into beside a clean pg_restore are
> UPDATE scripts for extensions and dependency management, even if that
> still needs improvements. Those improvements will be relevant for both
> ways to make extensions available for your system.

+1

> We can not use the name "module" for anything else, IMNSHO.

Agreed.

> The main goal here is not to have the extension live inside the database
> but rather to be able to bypass using the server's filesystem in order
> to be able to CREATE EXTENSION foo; and then to still have pg_restore do
> the right thing on its own.

Note that with the current, out-of-line approach, the *extension*
already lives inside the database. It's just the *template*, that
doesn't. (Modulo DSO, but the patch doesn't handle those either, yet. So
we're still kind of excluding those.)

Allowing for templates to live inside the database as well is a good
thing, IMO.

> If you want to scratch the new catalogs part, then just say that it's
> expected to be really complex to pg_restore a database using extensions,
> back to exactly how it was before 9.1: create the new database, create
> the extensions your dump depends on in that new database, now pg_restore
> your backup manually filtering away the extensions' objects or ignoring
> the errors when pg_restore tries to duplicate functions you already
> installed in the previous step. No fun.

Definitely not. Nobody wants to go back there. (And as Heikki pointed
out, if you absolutely want to, you can even punish yourself that way.)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

On 06/10/2013 09:43 PM, Hannu Krosing wrote:
> On 07/08/2013 09:26 AM, Heikki Linnakangas wrote:
>> The concept was useful, but not something we want
>> to call an extension, because the distinguishing feature of an
>> extension is that it lives outside the database and is *not* included
>> in pg_dump.
> Either MODULE or PACKAGE would be good name candidates.
> 
> Still, getting this functionality in seems more important than exact
> naming, though naming them "right" would be nice.

Remember that we already have quite a lot of extensions. And PGXN. Are
we really so wedded to the idea of extensions "living" outside of the
database that we need to come up with something different and incompatible?

Or do you envision modules or packages to be compatible with extensions?
Just putting another label on it so we can still claim extensions are
strictly external to the database? Sorry, I don't get the idea, there.

>From a users perspective, I want extensions, modules, or packages to be
managed somehow. Including upgrades, migrations (i.e. dump & restore)
and removal. The approach of letting the distributors handle that
packaging clearly has its limitations. What's so terribly wrong with
Postgres itself providing better tools to manage those?

Inventing yet another type of extension, module or package (compatible
or not) doesn't help, but increases confusion even further. Or how do
you explain to an author of an existing extension, whether or not he
should convert his extension to a module (if you want those to be
incompatible)?

If it's the same thing, just with different loading mechanisms, please
keep calling it the same: an extension. (And maintain compatibility
between the different ways to load it.)

I fully agree with the fundamental direction of Dimitri's patch. I think
Postgres needs to better manage its extensions itself. Including dump
and restore cycles. However, I think the implementation isn't optimal,
yet. I pointed out a few usability issues and gave reasons why
"template" is a misnomer (with the proposed implementation). "Extension"
is not.

(I still think "template" would be a good mental model. See my other
thread...
http://archives.postgresql.org/message-id/51d72c1d.7010...@bluegap.ch)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

Hello Dimitri,

On 07/07/2013 09:51 PM, Dimitri Fontaine wrote:
> Markus Wanner  writes:
>> Oh, I just realize that pg_extension_{template,control,uptmpl} are not
>> SHARED catalogs, but you need to install the template per-database and
>> then need to enable it - per-database *again*. Why is that?
> 
> Because the current model is not serving us well enough, with a single
> module version per major version of PostgreSQL. Meaning for all the
> clusters on the server, and all the databases in them.

That's not an excuse for letting the user duplicate the effort of
installing templates for every version of his extension in every database.

> We want to be able to have postgis 1.5 and 2.x installed in two
> different databases in the same cluster, don't we?

The extension, yes. The template versions, no. There's utterly no point
in having different 2.0 versions of the same extension in different
databases.

> Well the current patch we still can't because of the dynamically shared
> object side of things, but that's not a good reason to impose the same
> limitation on to the "template" idea.

Without a dynamically shared object, you can well have different
versions of an extension at work in different databases already.

> After playing around with several ideas around that in the past two
> development cycles, the community consensus clearly is that "extensions"
> are *NEVER* going to be part of your dump scripts.

Sounds strange to me. If you want Postgres to manage extensions, it
needs the ability to backup and restore them. Otherwise, it doesn't
really ... well ... manage them.

> Now when using templates you have no other source to install the
> extensions from at pg_restore time, given what I just said.
> 
> The whole goal of the "template" idea is to offer a way to dump and
> restore the data you need for CREATE EXTENSION to just work at restore
> time, even when you sent the data over the wire.

Which in turn violates the above cited community consensus, then. You
lost me here. What's your point? I thought the goal of your patch was
the ability to upload an extension via libpq.

How does that address my concerns about usability and understandability
of how these things work? Why the strange dependencies between templates
and extensions? Or the ability to rename the template, but not the
extension - while still having the later depend on the former?

These things are what I'm opposing to. And I don't believe it
necessarily needs to be exactly that way for dump and restore to work.
Quite the opposite, in fact. Simpler design usually means simpler backup
and restore procedures.

> Current extension are managed on the file system, the contract is that
> it is the user's job to maintain and ship that, externally to PostgreSQL
> responsibilities. All that PostgreSQL knows about is to issue the CREATE
> EXTENSION command at pg_restore time.
> 
> With templates or in-line extensions, the contract is that the user asks
> PostgreSQL to manage its extensions in the same way it does for the
> other objects on the system.

Understood.

> The design we found to address that is
> called "Extension Templates" and is implemented in the current patch.

I placed my concerns with the proposed implementation. It's certainly
not the only way how Postgres can manage its extensions. And I still
hope we can come up with something that's simpler to use and easier to
understand.

However, I'm not a committer nor have I written code for this. I did my
review and proposed two possible (opposite) directions for clean up and
simplification of the design. I would now like others to chime in.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

On 07/07/2013 02:55 PM, Markus Wanner wrote:
> If you want to just upload extensions to a database via libpq..

Let's rephrase this in a (hopefully) more constructive way: I get the
impression you are trying to satisfy many different needs. Way more that
you need to scratch your own itch. To the point that I had trouble
figuring out what exactly the goal of your patch is.

My advice would be: Be brave! Dare to put down any request for
"templates" (including mine) and go for the simplest possible
implementation that allows you to create extensions via libpq. (Provided
that really is the itch you want to scratch. I'm still not quite sure I
got that right.)

As it stands, I get the impression the patch is trying to sell me a
feature that it doesn't really provide. If you stripped the template
stuff, including the CREATE TEMPLATE FOR EXTENSION command, but just
sold me this patch as a simple way to create an extension via libpq,
i.e. something closer to CREATE EXTENSION AS .. , I would immediately
buy that. Currently, while allowing an upload, it seems far from simple,
but adds quite a bit of unwanted complexity. If all I want is to upload
code for an extension via libpq, I don't want to deal with nifty
distinctions between templates and extensions.

Just my opinion, though. Maybe I'm still missing something.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

On 07/06/2013 10:30 PM, Dimitri Fontaine wrote:
> I still think that we shouldn't allow creating a template for an
> extension that is already installed, though. Do you have any suggestions
> for a better error message?

Oh, I just realize that pg_extension_{template,control,uptmpl} are not
SHARED catalogs, but you need to install the template per-database and
then need to enable it - per-database *again*. Why is that?

If you want to just upload extensions to a database via libpq, that
should be a single step (maybe rather just CREATE EXTENSION ... AS ...)
If you want "templates", those should be applicable to all databases, no?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

elling point for this feature?
> 
> The main issue to fix when you want to have that feature, which I want
> to have, is how to define a sane pg_dump policy around the thing. As we
> couldn't define that in the last rounds of reviews, we decided not to
> allow the case yet.

I bet that's because people have different mental models in mind. But I
probably stressed that point enough by now...

> I think that's a fair remark that we want to get there eventually, and
> that like the superuser only limitation, that's something for a future
> patch and not this one.

I agree to defer a specific implementation. I disagree in that we should
*not* commit something that follows a mixture of mental models
underneath and sets in stone a strange API we later need to be
compatible with.

Specifically, I request to either follow the template model more closely
(accompanied by a separate patch to adjust binary, out-of-line
templates) or follow the link model more closely. The current naming
doesn't match any of the two, so renaming seems inevitable.

> Yeah, we might want to find some better naming for the on-file extension
> "templates". We can't just call them extension templates though because
> that is the new beast implemented in that patch and it needs to be part
> of your pg_dump scripts, while the main feature of the file based kind
> of templates is that they are *NOT* part of your dump.

That's one issue where the mixture of concepts shines through, yeah. I
fear simply (re)naming things is not quite enough, at this point.
Especially because the thing on the file-system is closer to being a
template than the system catalog based variant.

[ Maybe we could even go with "template extensions" being the
file-system ones vs. "inline extensions" being the system catalog
ones...  In that case, they would follow different mental models. Which
might even be a reasonable solution. However, we need to be aware of
that difference and document it properly, so that users have a chance to
grasp the nifty difference. I'd rather prefer to eliminate such nifty
differences, though. ]

> It's just something I forgot to cleanup when completing the feature set.
> Cleaned up in my git branch.

Great!

>> src/backend/commands/event_trigger.c, definition of
>> event_trigger_support: several unnecessary whitespaces added. These make
>> it hard(er) than necessary to review the patch.
> 
> Here with the following command I see no problem, please advise:
> 
> git diff --check --color-words postgres/master.. -- 
> src/backend/commands/event_trigger.c

That's why I personally don't trust git. Please look at the
patch you submitted.

> Markus Wanner  writes:
>> Dimitri, do you agree to resolve with "returned with feedback" for this
>> CF? Or how would you like to proceed?
> 
> I'd like to proceed, it's the 3rd development cycle that I'm working on
> this idea

Okay, I'm certainly not opposed to that and welcome further development.
I didn't change the status, so it should still be "waiting on author",
which seems reasonable to me, ATM.

> (used to be called "Inline Extensions", I also had a selective
> pg_dump patch to allow dumping some extension scripts, etc). I really
> wish we would be able to sort it out completely in this very CF and
> that's why I'm not proposing any other patch this time.

I understand it's tedious work, sometimes. Please look at this as
incremental progress. I would not have been able to reason about mental
models without your patch. Thanks for that, it was a necessary and good
step from my point of view. Please keep up the good work!

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-05 Thread Markus Wanner

On 07/05/2013 09:05 PM, Markus Wanner wrote:
> The patch has already
> been marked as 'returned with feedback', and I can support that
> resolution (for this CF).

Oops.. I just realize it's only set to "waiting on author", now. I guess
I confused the two states. Please excuse my glitch.

Dimitri, do you agree to resolve with "returned with feedback" for this
CF? Or how would you like to proceed?

Josh, thanks for linking the review in the CF.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal: template-ify (binary) extensions

2013-07-05 Thread Markus Wanner

Hi,

let me elaborate on an idea I had to streamline extension templates. As
I wrote in my recent review [1], I didn't have the mental model of a
template in mind for extensions, so far.

However, writing the review, I came to think of it. I certainly agree
that extensions which do not carry a binary executable can be considered
templates. The SQL scripts can be deleted after creation of the
extension in the database, because the objects created with them are an
instantiation in the database itself. They do not depend on the template.

That's not the case for compiled, executable code that usually lives in
a shared library on the filesystem. There, the "instantiation" of the
"template" merely consists of a link to the library on the file-system.
If you remove that file, you break the extension.

One way to resolve that - and that seems to be the direction Dimitri's
patch is going - is to make the extension depend on its template. (And
thereby breaking the mental model of a template, IMO. In the spirit of
that mental model, one could argue that code for stored procedures
shouldn't be duplicated, but instead just reference its ancestor.)

The other possible resolution is to make all extensions fit the template
model, i.e. make extensions independent of their templates.

This obviously requires Postgres to internalize the compiled binary code
of the library upon instantiation. (Storing the binary blobs by hash in
a shared catalog could easily prevent unwanted duplication between
databases.)

Advantages:
 - Consistent mental model: template
 - Closes the gap between how SQL scripts and binary parts of
   an extension are handled. (Or between binary and non-binary
   extensions.)
 - Compatibility and co-existence between file-system and
   system catalog based extension templates is simplified.
 - Independence of extensions from library files shipped by
   3rd party extensions (those would only ship a template, not
   the extension per se).
 - Paves the way to uploading extensions that carry a binary,
   executable payload via libpq.

Challenges:
 - CPU Architecture must be taken into account for backups. Restoring
   a backup on a different architecture would still require the
   (external) binary code for that specific architecture, because we
   cannot possibly store binaries for all architectures.
 - A bit of a mind shift for how binary extensions work.


Regards

Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: extension template

2013-07-05 Thread Markus Wanner

e place. However,
that's reasonably far down the wish-list.

[ Also note the nifty difference between "extension already available"
vs "extension already exists". The former seems to mean the "template"
exists, while the latter refers to the "instantiation". ]

However, the other way around cannot be prevented by Postgres: Files
representing an extension (template, if you want) can always be created
alongside a pre-existing system catalog installation. Postgres has no
way to prevent that (other than ignoring the files, maybe, but...). It
seems the file system variant takes precedence over anything
pre-existing in the system catalogs. This should be documented.

The documentation for ALTER TEMPLATE FOR EXTENSION says: "Currently, the
only supported functionality is to change the template's default
version." I don't understand what that's supposed to mean. You can
perfectly well rename and alter the template's code.

I didn't look too deep into the code, but it seems Jaime and Hitoshi
raised some valid points.

Assorted very minor nit-picks:

In catalog/objectaddress.c, get_object_address_unqualified(): the case
OBJECT_TABLESPACE is being moved down. That looks like an like an
unintended change. Please revert.

src/backend/commands/event_trigger.c, definition of
event_trigger_support: several unnecessary whitespaces added. These make
it hard(er) than necessary to review the patch.

Regards

Markus Wanner


[1]: CF: Extension Templates
https://commitfest.postgresql.org/action/patch_view?id=1032

[2]: Dimitri's github branch 'tmpl4':
https://github.com/dimitri/postgres/tree/tmpl4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 11:13 PM, Jeff Janes wrote:
> Wouldn't any IO system being used on a high-end system be fairly good
> about making this work through interleaved read-ahead algorithms?

To some extent, certainly. It cannot possibly get better than a fully
sequential load, though.

> That sounds like it would be much more susceptible to lock contention,
> and harder to get bug-free, than dividing into bigger chunks, like whole
> 1 gig segments.  

Maybe, yes. Splitting a known amount of work into equal pieces sounds
like a pretty easy parallelization strategy. In case you don't know the
total amount of work or the size of each piece in advance, it gets a bit
harder. Choosing chunks that turn out to be too big certainly hurts.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 06:35 PM, Nicolas Barbier wrote:
> I am assuming that this (comparatively very small and super-hot) index
> is cached all the time, while for the other indexes (that are
> supposedly super-huge) only the top part stays cached.
> 
> I am mostly just trying to find out where Yuri’s “partitioning is
> needed for super-huge tables” experience might come from, and noting
> that Heikki’s argument might not be 100% valid.

I think the OP made that clear by stating that his index has relatively
low selectivity. That seems to be a case that Postgres doesn't handle
very well.

> I think that the
> “PostgreSQL-answer” to this problem is to somehow cluster the data on
> the “hotness column” (so that all hot rows are close to one another,
> thereby improving the efficiency of the caching of relation blocks) +
> partial indexes for the hot rows (as first mentioned by Claudio; to
> improve the efficiency of the caching of index blocks).

Agreed, sounds like a sane strategy.

> My reasoning was: To determine which index block to update (typically
> one in both the partitioned and non-partitioned cases), one needs to
> walk the index first, which supposedly causes one additional (read)
> I/O in the non-partitioned case on average, because there is one extra
> level and the lower part of the index is not cached (because of the
> size of the index). I think that pokes a hole in Heikki’s argument of
> “it really doesn’t matter, partitioning == using one big table with
> big non-partial indexes.”

Heikki's argument holds for the general case, where you cannot assume a
well defined hot partition. In that case, the lowest levels of all the
b-trees of the partitions don't fit in the cache, either. A single index
performs better in that case, because it has lower overhead.

I take your point that in case you *can* define a hot partition and
apply partitioning, the hot(test) index(es) are more likely to be cached
and thus require less disk I/O.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
> Imagine that there are a lot of indexes, e.g., 50. Although a lookup
> (walking one index) is equally fast, an insertion must update al 50
> indexes. When each index requires one extra I/O (because each index is
> one level taller), that is 50 extra I/Os. In the partitioned case,
> each index would require the normal smaller amount of I/Os. Choosing
> which partition to use must only be done once: The result “counts” for
> all indexes that are to be updated.

I think you're underestimating the cost of partitioning. After all, the
lookup of what index to update for a given partition is a a lookup in
pg_index via pg_index_indrelid_index - a btree index.

Additionally, the depth of an index doesn't directly translate to the
number of I/O writes per insert (or delete). I'd rather expect the avg.
number of I/O writes per insert into a b-tree to be reasonably close to
one - depending mostly on the number of keys per page, not depth.

> Additionally: Imagine that the data can be partitioned along some
> column that makes sense for performance reasons (e.g., some “date”
> where most accesses are concentrated on rows with more recent dates).
> The other indexes will probably not have such a performance
> distribution. Using those other indexes (both for look-ups and
> updates) in the non-partitioned case, will therefore pull a huge
> portion of each index into cache (because of the “random distribution”
> of the non-date data). In the partitioned case, more cache can be
> spent on the indexes that correspond to the “hot partitions.”

That's a valid point, yes. I'd call this index partitioning. And with
partial indices, Postgres already has something that gets pretty close,
I think. Though, I don't consider this to be related to how the tuples
of the relation are laid out on disk.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
> We could also allow a large query to search a single table in parallel.
> A seqscan would be easy to divide into N equally-sized parts that can be
> scanned in parallel. It's more difficult for index scans, but even then
> it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Markus Wanner

On 06/25/2013 08:26 PM, Andres Freund wrote:
> It's not about the reviewers being less. It's a comparison of
> effort. The effort for a casual review simply isn't comparable with the
> effort spent on developing a nontrivial patch.

Remember: "Debugging is twice as hard as writing the code in the first
place. ..." (Brian Kernighan)

IMO, the kind of reviews we need are of almost "debug level" difficulty.
(To the point where the reviewer becomes a co-author or even takes over
and submits a completely revamped patch instead.)

I agree that the casual review is several levels below that, so your
point holds. I doubt we need more reviews of that kind, though.

Thus, I'm in the AAB camp. The remaining question being: What's the
criterion for becoming a co-author (and thus getting mentioned in the
release notes)?

If at all, we should honor quality work with a "prize". Maybe a price
for the best reviewer per CF? Maybe even based on votes from the general
public on who's been the best, so reviews gain attention that way...
"Click here to vote for my review." ... Maybe a crazy idea.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 04:01 PM, k...@rice.edu wrote:
> I think he is referring to the fact that with parallel query execution,
> multiple partitions can be processed simultaneously instead of serially
> as they are now with the resulting speed increase.

Processing simultaneously is the purpose of parallel query execution,
yes. But I see no reason for that not to work equally well for
unpartitioned tables.

Disk I/O is already pretty well optimized and parallelized, I think.
Trying to parallelize a seq scan on the Postgres side is likely to yield
far inferior performance.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 04:10 PM, Yuri Levinsky wrote:
> You typically don't want to use b-tree index when yo select
> more when ~1-2% of your data. 

Agreed. Indices on columns with very low selectivity don't perform well.
(Postgres knows that and uses a sequential scan based on selectivity
estimates. Being able to eliminate entire partitions from such a seq
scan would certainly be beneficial, yes.)

In the Postgres world, though, I think CLUSTERing might be the better
approach to solve that problem. (Note: this has nothing to do with
distributed systems in this case.) I'm not sure what the current status
of auto clustering or optimized scans on such a permanently clustered
table is, though.

The minmax indices proposed for 9.4 might be another feature worth
looking at.

Both of these approaches may eventually provide a more general and more
automatic way to speed up scans on large portions of a relation, IMO.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> At least until we have parallel
> query execution.  At *that* point this all changes.

Can you elaborate on that, please? I currently have a hard time
imagining how partitions can help performance in that case, either. At
least compared to modern RAID and read-ahead capabilities.

After all, RAID can be thought of as hash partitioning with a very weird
hash function. Or maybe rather range partitioning on an internal key.

Put another way: ideally, the system should take care of optimally
distributing data across its physical storage itself. If you need to do
partitioning manually for performance reasons, that's actually a
deficiency of it, not a feature.

I certainly agree that manageability may be a perfectly valid reason to
partition your data. Maybe there even exist other good reasons. I don't
think performance optimization is one. (It's more like giving the system
a hint. And we all dislike hints, don't we? *ducks*)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic background workers

2013-06-20 Thread Markus Wanner

On 06/20/2013 04:41 PM, Robert Haas wrote:
> The constant factor is also very small.  Generally, I would expect
> num_worker_processes <~ # CPUs

That assumption might hold for parallel querying, yes. In case of
Postgres-R, it doesn't. In the worst case, i.e. with a 100% write load,
a cluster of n nodes, each with m backends performing transactions, all
of them replicated to all other (n-1) nodes, you end up with ((n-1) * m)
bgworkers. Which is pretty likely to be way above the # CPUs on any
single node.

I can imagine other extensions or integral features like autonomous
transactions that might possibly want many more bgworkers as well.

> and scanning a 32, 64, or even 128
> element array is not a terribly time-consuming operation.

I'd extend that to say scanning an array with a few thousand elements is
not terribly time-consuming, either. IMO the simplicity is worth it,
ATM. It's all relative to your definition of ... eh ... "terribly".

.oO( ... premature optimization ... all evil ... )

> We might
> need to re-think this when systems with 4096 processors become
> commonplace, but considering how many other things would also need to
> be fixed to work well in that universe, I'm not too concerned about it
> just yet.

Agreed.

> One thing I think we probably want to explore in the future, for both
> worker backends and regular backends, is pre-forking.  We could avoid
> some of the latency associated with starting up a new backend or
> opening a new connection in that way.  However, there are quite a few
> details to be thought through there, so I'm not eager to pursue that
> just yet.  Once we have enough infrastructure to implement meaningful
> parallelism, we can benchmark it and find out where the bottlenecks
> are, and which solutions actually help most.

Do you mean pre-forking and connecting to a specific database? Or really
just the forking?

> I do think we need a mechanism to allow the backend that requested the
> bgworker to know whether or not the bgworker got started, and whether
> it unexpectedly died.  Once we get to the point of calling user code
> within the bgworker process, it can use any number of existing
> mechanisms to make sure that it won't die without notifying the
> backend that started it (short of a PANIC, in which case it won't
> matter anyway).  But we need a way to report failures that happen
> before that point.  I have some ideas about that, but decided to leave
> them for future passes.  The remit of this patch is just to make it
> possible to dynamically register bgworkers.  Allowing a bgworker to be
> "tied" to the session that requested it via some sort of feedback loop
> is a separate project - which I intend to tackle before CF2, assuming
> this gets committed (and so far nobody is objecting to that).

Okay, sounds good. Given my background, I considered that a solved
problem. Thanks for pointing it out.

>> Sounds like the postmaster is writing to shared memory. Not sure why
>> I've been trying so hard to avoid that, though. After all, it can hardly
>> hurt itself *writing* to shared memory.
> 
> I think there's ample room for paranoia about postmaster interaction
> with shared memory, but all it's doing is setting a flag, which is no
> different from what CheckPostmasterSignal() already does.

Sounds good to me.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic background workers

2013-06-20 Thread Markus Wanner

Robert,

On 06/14/2013 11:00 PM, Robert Haas wrote:
> Parallel query, or any subset of that project such as parallel sort,
> will require a way to start background workers on demand.

thanks for continuing this, very much appreciated. Postgres-R and thus
TransLattice successfully use a similar approach for years, now.

I only had a quick glance over the patch, yet. Some comments on the design:

> This requires some communication channel from ordinary workers to the
> postmaster, because it is the postmaster that must ultimately start
> the newly-registered workers.  However, that communication channel has
> to be designed pretty carefully, lest a shared memory corruption take
> out the postmaster and lead to inadvertent failure to restart after a
> crash.  Here's how I implemented that: there's an array in shared
> memory of a size equal to max_worker_processes.  This array is
> separate from the backend-private list of workers maintained by the
> postmaster, but the two are kept in sync.  When a new background
> worker registration is added to the shared data structure, the backend
> adding it uses the existing pmsignal mechanism to kick the postmaster,
> which then scans the array for new registrations.

That sounds like a good simplification. Even if it's an O(n) operation,
the n in question here has relatively low practical limits. It's
unlikely to be much of a concern, I guess.

Back then, I solved it by having a "fork request slot". After starting,
the bgworker then had to clear that slot and register with a coordinator
process (i.e. the original requestor), so that one learned its fork
request was successful. At some point I expanded that to multiple
request slots to better handle multiple concurrent fork requests.
However, it was difficult to get right and requires more IPC than your
approach.

On the pro side: The shared memory area used by the postmaster was very
small in size and read-only to the postmaster. These were my main goals,
which I'm not sure were the best ones, now that I read your concept.

> I have attempted to
> make the code that transfers the shared_memory state into the
> postmaster's private state as paranoid as humanly possible.  The
> precautions taken are documented in the comments.  Conversely, when a
> background worker flagged as BGW_NEVER_RESTART is considered for
> restart (and we decide against it), the corresponding slot in the
> shared memory array is marked as no longer in use, allowing it to be
> reused for a new registration.

Sounds like the postmaster is writing to shared memory. Not sure why
I've been trying so hard to avoid that, though. After all, it can hardly
hurt itself *writing* to shared memory.

> Since the postmaster cannot take locks, synchronization between the
> postmaster and other backends using the shared memory segment has to
> be lockless.  This mechanism is also documented in the comments.  An
> lwlock is used to prevent two backends that are both registering a new
> worker at about the same time from stomping on each other, but the
> postmaster need not care about that lwlock.
> 
> This patch also extends worker_spi as a demonstration of the new
> interface.  With this patch, you can CREATE EXTENSION worker_spi and
> then call worker_spi_launch(int4) to launch a new background worker,
> or combine it with generate_series() to launch a bunch at once.  Then
> you can kill them off with pg_terminate_backend() and start some new
> ones.  That, in my humble opinion, is pretty cool.

It definitely is. Thanks again.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-20 Thread Markus Wanner

On 06/20/2013 12:27 PM, Marko Tiikkaja wrote:
> My understanding is that the attacker would already have that
> information since the server would have sent an
> AuthenticationMD5Password message to get to the error in the first
> place.  And we still reveal the authentication method to the frontend in
> all other cases ("peer authentication failed", for example).

Oh, right, I wasn't aware of that. Never mind, then.

+1 for keeping it mention "password authentication" explicitly.

However, thinking about this a bit more: Other authentication methods
may also provide password (or even account) expiration times. And may
fail to authenticate a user for entirely different reasons.

Given that, I wonder if "password expired" is such a special case worth
mentioning in case of the "password auth" method. If we go down that
path, don't we also have to include "auth server unreachable" as a
possible cause for authentication failure for methods that use an
external server?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-19 Thread Markus Wanner

On 06/20/2013 12:51 AM, Jeff Janes wrote:
> I think we need to keep the first "password".  "Password authentication"
> is a single thing, it is the authentication method attempted.  It is the
> password method (which includes MD5) which failed, as opposed to the
> LDAP method or the Peer method or one of the other methods.

That's against the rule of not revealing any more knowledge than a
potential attacker already has, no? For that reason, I'd rather go with
just "authentication failed".

> Without this level of explicitness, it might be hard to figure out which
> row in pg_hba.conf was the one that PostgreSQL glommed onto to use for
> authentication.

As argued before, that should go into the logs for diagnosis by the
sysadmin, but should not be revealed to an attacker.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-19 Thread Markus Wanner

This probably is nit-picking, but it interests me in terms of how the
language is used and understood.

On 06/19/2013 08:55 PM, Joshua D. Drake wrote:
> I believe it actually can. The error message that is returned for a bad
> password, bad user or expired password is all the same. Which is why I
> put the username in there.

Sure, the authentication can fail for all these reasons. What I stumbled
over was the formulation of a "failed username". If an engine fails, it
might literally fall apart. The username itself - even if it doesn't
pass authentication - is not falling apart in the same sense. But does
the username (or the password) fail if authentication with it (in
combination with password and account expiration time) is not possible?
After all, it might still a valid and complete username for another
cluster or another service.

You can probably say: "that username failed" when you actually mean it
"failed to authenticate together with the provided password". Or how do
English native speakers perceive this?

> "Authentication failed or password has expired for user \"%s\""
> 
> Authentication failed covers any combination of a username/password
> being wrong and obviously password expired covers the other.

Works for me. Considering the password to be the thing that expires
(rather than the account) is probably more accurate as well.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-18 Thread Markus Wanner

On 06/16/2013 06:02 PM, Joshua D. Drake wrote:
> Instead of pushing extra info to the logs I decided that we could
> without giving away extra details per policy. I wrote the error message
> in a way that tells the most obvious problems, without admitting to any
> of them. Please see attached:

+1 for solving this with a bit of word-smithing.

However, the proposed wording doesn't sound like a full sentence to my
ears, because a password or username cannot fail per-se.

How about:
"password authentication failed or account expired for user \"%s\""

It's a bit longer, but sounds more like a full sentence, no?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2013-06-10 Thread Markus Wanner

Ants,

the more I think about this, the more I start to like it.

On 06/07/2013 02:50 PM, Ants Aasma wrote:
> On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner  wrote:
>> Agreed. Postgres-R uses a CommitOrderId, which is very similar in
>> concept, for example.
> 
> Do you think having this snapshot scheme would be helpful for Postgres-R?

Yeah, it could help to reduce patch size, after a rewrite to use such a CSN.

>> Or why do you need to tell apart aborted from in-progress transactions
>> by CSN?
> 
> I need to detect aborted transactions so they can be discared during
> the eviction process, otherwise the sparse array will fill up. They
> could also be filtered out by cross-referencing uncommitted slots with
> the procarray. Having the abort case do some additional work to make
> xid assigment cheaper looks like a good tradeoff.

I see.

>>> Sparse buffer needs to be at least big enough to fit CSN slots for the
>>> xids of all active transactions and non-overflowed subtransactions. At
>>> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
>>> at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
>>> or 203KB total in the default configuration.
>>
>> A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
>> the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
> 
> 8 byte alignment for CSNs is needed for atomic if not something else.

Oh, right, atomic writes.

> I think the size could be cut in half by using a base value for CSNs
> if we assume that no xid is active for longer than 2B transactions as
> is currently the case. I didn't want to include the complication in
> the first iteration, so I didn't verify if that would have any
> gotchas.

In Postgres-R, I effectively used a 32-bit order id which wraps around.

In this case, I guess adjusting the base value will get tricky. Wrapping
could probably be used as well, instead.

> The number of times each cache line can be invalidated is
> bounded by 8.

Hm.. good point.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2013-06-07 Thread Markus Wanner

ple processes may need to
do such a conversion, all starting at the same point in time.

> Rolling back a transaction
> --
> 
> Rolling back a transaction is basically the same as committing, but
> the CSN slots need to be stamped with a AbortedCSN.

Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?

> Sizing the CSN maps
> ---
> 
> CSN maps need to sized to accomodate the number of backends.
> 
> Dense array size should be picked so that most xids commit before
> being evicted from the dense map and sparse array will contain slots
> necessary for either long running transactions or for long snapshots
> not yet converted to XID based snapshots. I did a few quick
> simulations to measure the dynamics. If we assume a power law
> distribution of transaction lengths and snapshots for the full length
> of transactions with no think time, then 16 slots per backend is
> enough to make the probability of eviction before commit less than
> 1e-6 and being needed at eviction due to a snapshot about 1:1. In
> the real world very long transactions are more likely than predicted
> model, but at least the normal variation is mostly buffered by this
> size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
> the default value of 100 backends, or 125KB for 1000 backends.

Sounds reasonable to me.

> Sparse buffer needs to be at least big enough to fit CSN slots for the
> xids of all active transactions and non-overflowed subtransactions. At
> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
> at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
> or 203KB total in the default configuration.

A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

> Performance discussion
> --
> 
> Taking a snapshot is extremely cheap in this scheme, I think the cost
> will be mostly for publishing the csnmin and rechecking validity after
> it. Taking snapshots in the shadow of another snapshot (often the case
> for the proposed MVCC catalog access) will be even cheaper as we don't
> have to advertise the new snapshot. The delayed XID based snapshot
> construction should be unlikely, but even if it isn't the costs are
> similar to GetSnapshotData, but we don't need to hold a lock.
> 
> Visibility checking will also be simpler as for the cases where the
> snapshot is covered by the dense array it only requires two memory
> lookups and comparisons.

Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)

> At this point I don't see any major issues with this approach. If the
> ensuing discussion doesn't find any major showstoppers then I will
> start to work on a patch bit-by-bit.

Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format.  :-)

> we have a small one in the family.

Congratulations on that one.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: logical changeset generation v3 - comparison to Postgres-R change set format

2013-01-14 Thread Markus Wanner

On 01/13/2013 09:04 PM, Hannu Krosing wrote:
> I'd just start with what send() and recv() on each type produces
> now using GCC on 64bit Intel and move towards adjusting others
> to match. For a period anything else would still be allowed, but
> be "non-standard"

Intel being little endian seems like a bad choice to me, given that
send/recv kind of implies network byte ordering. I'd rather not tie this
to any particular processor architecture at all (at least not solely on
the ground that it's the most common one at the time).

I have no strong opinion on "sameness" of NULLs and could also imagine
that to throw some kind of invalid operation error. Based on the ground
that neither is a value and it's unclear what send() method to use at all.

FWIW, trying to determine the length of a sent NULL gives an interesting
result that I don't currently understand.

> psql (9.2.2)
> Type "help" for help.
> 
> postgres=# SELECT length(int4send(NULL));
>  length 
> 
>
> (1 row)
> 
> postgres=# SELECT length(float4send(NULL));
>  length 
> 
>
> (1 row)
> 
> postgres=# SELECT length(textsend(NULL));
>  length 
> 
>
> (1 row)
>
> postgres=# SELECT length(textsend(NULL) || '\000'::bytea);
>  length 
> 
>
> (1 row)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: logical changeset generation v3 - comparison to Postgres-R change set format

2013-01-13 Thread Markus Wanner

On 01/13/2013 12:30 PM, Hannu Krosing wrote:
> On 01/13/2013 10:49 AM, Hannu Krosing wrote:
>> Does this hint that postgreSQL also needs an sameness operator
>> ( "is" or "===" in same languages).
> 
> How do people feel about adding a real sameness operator ?

We'd need to define what "sameness" means. If this goes toward "exact
match in binary representation", this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A "sameness" operator could
then be implemented by simply comparing two value's send() outputs.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-04 Thread Markus Wanner

On 12/03/2012 10:35 PM, Alvaro Herrera wrote:
> So here's version 8.  This fixes a couple of bugs and most notably
> creates a separate PGPROC list for bgworkers, so that they don't
> interfere with client-connected backends.

Nice, thanks. I'll try to review this again, shortly.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

Alvaro,

in general, please keep in mind that there are two aspects that I tend
to mix and confuse: one is what's implemented and working for
Postgres-R. The other this is how I envision (parts of it) to be merged
back into Postgres, itself. Sorry if that causes confusion.

On 12/03/2012 04:43 PM, Alvaro Herrera wrote:
> Oh, I understand that.  My question was more about what mechanism are
> you envisioning for new bgworkers.  What process would be in charge of
> sending such signals?  Would it be a persistent bgworker which would
> decide to start up other bgworkers based on external conditions such as
> timing?  And how would postmaster know which bgworker to start -- would
> it use the bgworker's name to find it in the registered workers list?

Well, in the Postgres-R case, I've extended the autovacuum launcher to a
more general purpose job scheduler, named coordinator. Lacking your
bgworker patch, it is the only process that is able to launch background
workers.

Your work now allows extensions to register background workers. If I
merge the two concepts, I can easily imagine allowing other (bgworker)
processes to launch bgworkers.

Another thing I can imagine is turning that coordinator into something
that can schedule and trigger jobs registered by extensions (or even
user defined ones). Think: cron daemon for SQL jobs in the background.
(After all, that's pretty close to what the autovacuum launcher does,
already.)

> The trouble with the above rough sketch is how does the coordinating
> bgworker communicate with postmaster.

I know. I've gone through that pain.

In Postgres-R, I've solved this with imessages (which was the primary
reason for rejection of the bgworker patches back in 2010).

The postmaster only needs to starts *a* background worker process. That
process itself then has to figure out (by checking its imessage queue)
what job it needs to perform. Or if you think in terms of bgworker
types: what type of bgworker it needs to be.

> Autovacuum has a very, um,
> peculiar mechanism to make this work: avlauncher sends a signal (which
> has a hardcoded-in-core signal number) and postmaster knows to start a
> generic avworker; previously avlauncher has set things up in shared
> memory, and the generic avworker knows where to look to morph into the
> specific bgworker required.  So postmaster never really looks at shared
> memory other than the signal number (which is the only use of shmem in
> postmaster that's acceptable, because it's been carefully coded to be
> robust).

In Postgres-R, I've extended exactly that mechanism to work for other
jobs that autovacuum.

> This doesn't work for generic modules because we don't have a
> hardcoded signal number; if two modules wanted to start up generic
> bgworkers, how would postmaster know which worker-main function to call?

You've just described above how this works for autovacuum: the
postmaster doesn't *need* to know. Leave it to the bgworker to determine
what kind of task it needs to perform.

> As posted, the feature is already useful and it'd be good to have it
> committed soon so that others can experiment with whatever sample
> bgworkers they see fit.  That will give us more insight on other API
> refinements we might need.

I completely agree. I didn't ever intend to provide an alternative patch
or hold you back. (Except for the extra daemon issue, where we disagree,
but that's not a reason to keep this feature back). So please, go ahead
and commit this feature (once the issues I've mentioned in the review
are fixed).

Please consider all of these plans or ideas in here (or in Postgres-R)
as extending on your work, rather than competing against.

> I'm going to disappear on paternity leave, most likely later this week,
> or early next week; I would like to commit this patch before that.  When
> I'm back we can discuss other improvements.  That will give us several
> weeks before the end of the 9.3 development period to get these in.  Of
> course, other committers are welcome to improve the code in my absence.

Okay, thanks for sharing that. I'd certainly appreciate your inputs on
further refinements for bgworkers and/or autovacuum.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 04:44 PM, Alvaro Herrera wrote:
> Simon Riggs wrote:
>> Is there anything to be gained *now* from merging those two concepts?
>> I saw that as refactoring that can occur once we are happy it should
>> take place, but isn't necessary.
> 
> IMO it's a net loss in robustness of the autovac implementation.

Agreed.

That's only one aspect of it, though. From the other point of view, it
would be a proof of stability for the bgworker implementation if
autovacuum worked on top of it.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 04:27 PM, Simon Riggs wrote:
> My understanding was that the patch keep autovac workers and
> background workers separate at this point.

I agree to that, for this patch. However, that might not keep me from
trying to (re-)sumbit some of the bgworker patches in my queue. :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 03:28 PM, Alvaro Herrera wrote:
> I'm not really that interested in it currently ... and there are enough
> other patches to review.  I would like bgworkers to mature a bit more
> and get some heavy real world testing before we change autovacuum.

Understood.

>> Just like av_launcher does it now: set a flag in shared memory and
>> signal the postmaster (PMSIGNAL_START_AUTOVAC_WORKER).
> 
> I'm not sure how this works.  What process is in charge of setting such
> a flag?

The only process that currently starts background workers ... ehm ...
autovacuum workers is the autovacuum launcher. It uses the above
Postmaster Signal in autovacuum.c:do_start_autovacuum_worker() to have
the postmaster launch bg/autovac workers on demand.

(And yes, this kind of is an exception to the rule that the postmaster
must not rely on shared memory. However, these are just flags we write
atomically, see pmsignal.[ch])

I'd like to extend that, so that other processes can request to start
(pre-registered) background workers more dynamically. I'll wait for an
update of your patch, though.

> This is actually a very silly bug: it turns out that guc.c obtains the
> count of workers before workers have actually registered.  So this
> necessitates some reshuffling.

Okay, thanks.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 03:58 PM, Kohei KaiGai wrote:
> It seemed to me you are advocating that any use case of background-
> worker can be implemented with existing separate daemon approach.

That sounds like a misunderstanding. All I'm advocating is that only
3rd-party processes with a real need (like accessing shared memory)
should run under the postmaster.

> What I wanted to say is, we have some cases that background-worker
> framework allows to implement such kind of extensions with more
> reasonable design.

I absolutely agree to that. And I think I can safely claim to be the
first person to publish a patch that provides some kind of background
worker infrastructure for Postgres.

> Yes, one reason I want to use background-worker is access to shared-
> memory segment. Also, it want to connect databases simultaneously
> out of access controls; like as autovacuum.

Yeah, that's the entire reason for background workers. For clarity and
differentiation, I'd add: .. without having a client connection. That's
what makes them *background* workers. (Not to be confused with the
frontend vs backend differentiation. They are background backends, if
you want).

> It is a reason why I'm saying
> SPI interface should be only an option, not only libpq.

I'm extending that to say extensions should better *not* use libpq.
After all, they have a more direct access, already.

> It also probably means, in case when a process whose duration wants to
> be tied with duration of postmaster, its author can consider to implement
> it as background worker.

I personally don't count that as a real need. There are better tools for
this job; while there clearly are dangers in (ab)using the postmaster to
do it.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 03:16 PM, Kohei KaiGai wrote:
> This feature does not enforce them to implement with this new framework.
> If they can perform as separate daemons, it is fine enough.

I'm not clear on what exactly you envision, but if a process needs
access to shared buffers, it sounds like it should be a bgworker. I
don't quite understand why that process also wants a libpq connection,
but that's certainly doable.

> But it is not all the cases where we want background workers being tied
> with postmaster's duration.

Not wanting a process to be tied to postmaster's duration is a strong
indication that it better not be a bgworker, I think.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

Alvaro,

On 11/30/2012 02:44 PM, Alvaro Herrera wrote:
> So it
> makes easier to have processes that need to run alongside postmaster.

That's where we are in respectful disagreement, then. As I don't think
that's easier, overall, but in my eye, this looks like a foot gun.

As long as things like pgbouncer, pgqd, etc.. keep to be available as
separate daemons, I'm happy, though.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 01:59 PM, Andres Freund wrote:
> On 2012-11-30 09:57:20 -0300, Alvaro Herrera wrote:
>> One of the uses for bgworkers that don't have shmem connection is to
>> have them use libpq connections instead.  I don't really see the point
>> of forcing everyone to use backend connections when libpq connections
>> are enough.

Requiring a libpq connection is a good indication for *not* wanting the
process to run under the postmaster, IMO.

>> In particular, they are easier to port from existing code;
>> and they make it easier to share code with systems that still have to
>> support older PG versions.
> 
> They also can get away with a lot more crazy stuff without corrupting
> the database.

Exactly. That's a good reason to *not* tie that to the postmaster, then.
Please keep as much of the potentially dangerous stuff separate (and
advice developers to do so as well, instead of offering them a foot
gun). So that our postmaster can do its job. And do it reliably, without
trying to be a general purpose start/stop daemon. There are better and
well established tools for that.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 01:40 PM, Dimitri Fontaine wrote:
> I totally missed the need to connect to shared memory to be able to
> connect to a database and query it. Can't we just link the bgworkder
> code to the client libpq library, just as plproxy is doing I believe?

Well, sure you can use libpq. But so can external processes. So that's
no benefit of running under the postmaster.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 11:09 AM, Dimitri Fontaine wrote:
> Markus Wanner  writes:
>>> However, as you say, maybe we need more coding examples.
>>
>> Maybe a minimally usable extra daemon example? Showing how to avoid
>> common pitfalls? Use cases, anybody? :-)
> 
> What about the PGQ ticker, pgqd?
> 
>   https://github.com/markokr/skytools/tree/master/sql/ticker
>   https://github.com/markokr/skytools/blob/master/sql/ticker/pgqd.c

AFAICS pgqd currently uses libpq, so I think it would rather turn into
what I call a background worker, with a connection to Postgres shared
memory. I perfectly well see use cases (plural!) for those.

What I'm questioning is the use for what I rather call "extra daemons",
that is, additional processes started by the postmaster without a
connection to Postgres shared memory (and thus without a database
connection).

I was asking for a minimal example of such an extra daemon, similar to
worker_spi, showing how to properly write such a thing. Ideally showing
how to avoid common pitfalls.

> Maybe it would be easier to have a version of GNU mcron as an extension,
> with the abitity to fire PostgreSQL stored procedures directly?  (That
> way the cron specific parts of the logic are already implemented)
> 
>   http://www.gnu.org/software/mcron/

Again, that's something that would eventually require a database
connection. Or at least access to Postgres shared memory to eventually
start the required background jobs. (Something Alvaro's patch doesn't
implement, yet. But which exactly matches what the coordinator and
bgworkers in Postgres-R do.)

For ordinary extra daemons, I'm worried about things like an OOM killer
deciding to kill the postmaster, being its parent. Or (io)niceness
settings. Or Linux cgroups. IMO all of these things just get harder to
administrate for extra daemons, when they move under the hat of the
postmaster.

Without access to shared memory, the only thing an extra daemon would
gain by moving under postmaster is the "knowledge" of the start, stop
and restart (crash) events of the database. And that in a very
inflexible way: the extra process is forced to start, stop and restart
together with the database to "learn" about these events.

Using a normal client connection arguably is a better way to learn about
crash events - it doesn't have the above limitation. And the start and
stop events, well, the DBA or sysadmin is under control of these,
already. We can possibly improve on that, yes. But extra daemons are not
the way to go, IMO.

> Another idea would be to have a pgbouncer extension. We would still need
> of course to have pgbouncer as a separate component so that client
> connection can outlive a postmaster crash, but that would still be very
> useful as a first step into admission control. Let's not talk about the
> feedback loop and per-cluster resource usage monitoring yet, but I guess
> that you can see the drift.

Sorry, I don't. Especially not with something like pgbouncer, because I
definitely *want* to control and administrate that separately.

So I guess this is a vote to disallow `worker.bgw_flags = 0` on the
basis that everything a process, which doesn't need to access Postgres
shared memory, better does whatever it does outside of Postgres. For the
benefit of the stability of Postgres and for ease of administration of
the two.

Or, maybe, rather drop the BGWORKER_SHMEM_ACCESS flag and imply that
every registered process wants to have access to Postgres shared memory.
Then document the gotchas and requirements of living under the
Postmaster, as I've requested before. (If you want a foot gun, you can
still write an extension that doesn't need to access Postgres shared
memory, but hey.. I we can't help those who desperately try to shoot
their foot).

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-28 Thread Markus Wanner

On 11/28/2012 03:51 PM, Alvaro Herrera wrote:
> I remember your patchset.  I didn't look at it for this round, for no
> particular reason.  I did look at KaiGai's submission from two
> commitfests ago, and also at a patch from Simon which AFAIK was never
> published openly.  Simon's patch merged the autovacuum code to make
> autovac workers behave like bgworkers as handled by his patch, just like
> you suggest.  To me it didn't look all that good; I didn't have the guts
> for that, hence the separation.  If later bgworkers are found to work
> well enough, we can "port" autovac workers to use this framework, but
> for now it seems to me that the conservative thing is to leave autovac
> untouched.

Hm.. interesting to see how the same idea grows into different patches
and gets refined until we end up with something considered committable.

Do you remember anything in particular that didn't look good? Would you
help reviewing a patch on top of bgworker-7 that changed autovacuum into
a bgworker?

> (As an example, we introduced "ilist" some commits back and changed some
> uses to it; but there are still many places where we're using SHM_QUEUE,
> or List, or open-coded lists, which we could port to the new
> infrastructure, but there's no pressing need to do it.)

Well, I usually like cleaning things up earlier rather than later (my
desk clearly being an exception to that rule, though). But yeah, new
code usually brings new bugs with it.

> Sorry about that -- forgot to push to github.  bgworker-7 corresponds to
> commit 0a49a540b which I have just pushed to github.

Thanks.

> Maybe one thing to do in this area would be to ensure that there is a
> certain number of PGPROC elements reserved specifically for bgworkers;
> kind of like autovacuum workers have.  That would avoid having regular
> clients exhausting slots for bgworkers, and vice versa.

Yeah, I think that's mandatory, anyways, see below.

> How are you envisioning that the requests would occur?

Just like av_launcher does it now: set a flag in shared memory and
signal the postmaster (PMSIGNAL_START_AUTOVAC_WORKER).

(That's why I'm so puzzled: it looks like it's pretty much all there,
already. I even remember a discussion about that mechanism probably not
being fast enough to spawn bgworkers. And a proposal to add multiple
such flags, so an avlauncher-like daemon could ask for multiple
bgworkers to be started in parallel. I've even measured the serial
bgworker fork rate back then, IIRC it was in the hundreds of forks per
second...)

>> # Minor technical issues or questions
>>
>> In assign_maxconnections, et al, GetNumRegisteredBackgroundWorkers() is
>> used in relation to MAX_BACKENDS or to calculate MaxBackends. That seems
>> to "leak" the bgworkers that registered with neither
>> BGWORKER_SHMEM_ACCESS nor BGWORKER_BACKEND_DATABASE_CONNECTION set. Or
>> is there any reason to discount such extra daemon processes?
> 
> No, I purposefully let those out, because those don't need a PGPROC.  In
> fact that seems to me to be the whole point of non-shmem-connected
> workers: you can have as many as you like and they won't cause a
> backend-side impact.  You can use such a worker to connect via libpq to
> the server, for example.

Hm.. well, in that case, the shmem-connected ones are *not* counted. If
I create and run an extensions that registers 100 shmem-connected
bgworkers, I cannot connect to a system with max_connections=100
anymore, because bgworkers then occupy all of the connections, already.

Please add the registered shmem-connected bgworkers to the
max_connections limit. I think it's counter intuitive to have autovacuum
workers reserved separately, but extension's bgworkers eat (client)
connections. Or put another way: max_connections should always be the
max number of *client* connections the DBA wants to allow.

(Or, if that's in some way complicated, please give the DBA an
additional GUC akin to max_background_workers. That can be merged with
the current max_autovacuum_workers, once/if we rebase autovaccum onto
bgworkers).

>> The additional contrib modules auth_counter and worker_spi are missing
>> from the contrib/Makefile. If that's intentional, they should probably
>> be listed under "Missing".
>>
>> The auth_counter module leaves worker.bgw_restart_time uninitialized.
>>
>> Being an example, it might make sense for auth_counter to provide a
>> signal that just calls SetLatch() to interrupt WaitLatch() and make
>> auth_counter emit its log line upon request, i.e. show how to use SIGUSR1.
> 
> KaiGai proposed that we remove auth_counter.  I don't see why not; I
> mean, worker_spi is already doing most of what auth

[HACKERS] Review: Extra Daemons / bgworker

2012-11-28 Thread Markus Wanner

r extra
daemons. I don't want the postmaster to turn into a general purpose
watchdog for everything and the kitchen sink.)


# Tests

No additional automated tests are included - hard to see how that could
be tested automatically, given the lack of a proper test harness.


I hope this review provides useful input.

Regards

Markus Wanner


[1]: That was during commit fest 2010-09. Back then, neither extensions,
latches nor the walsender existed. The patches had a dependency on
imessages, which was not accepted. Other than that, it's a working,
pooled bgworker implementation on top of which autovacuum runs just
fine. It lacks extensibility and the extra daemons feature, though. For
those who missed it:

https://commitfest.postgresql.org/action/patch_view?id=339
http://archives.postgresql.org/message-id/4c3c789c.1040...@bluegap.ch
http://git.postgres-r.org/?p=bgworker;a=summary


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [pgsql-cluster-hackers] Question: Can i cut NON-HOT chain Pointers if there are no concurrent updates?

2012-11-23 Thread Markus Wanner

Henning,

On 11/23/2012 03:17 PM, "Henning Mälzer" wrote:
> Can somebody help me?

Sure, but you might get better answers on the -hackers mailing list. I'm
redirecting there. The cluster-hackers one is pretty low volume and low
subscribers, I think.

> Question:
> What would be the loss if i cut NON-HOT chain Pointers, meaning i set 
> t_ctid=t_self in the case where it points to a tuple on another page?

READ COMMITTED would stop to work correctly in the face of concurrent
transactions, I guess. See the fine manual:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-READ-COMMITTED

The problem essentially boils down to: READ COMMITTED transactions need
to learn about tuples *newer* than what their snapshot would see.

> I am working on a project based on "postgres (PostgreSQL) 8.5devel" with the 
> code from several master thesises befor me.

Care to elaborate a bit? Can (part of) that code be released under an
open license?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

Hannu,

On 11/17/2012 03:40 PM, Hannu Krosing wrote:
> On 11/17/2012 03:00 PM, Markus Wanner wrote:
>> On 11/17/2012 02:30 PM, Hannu Krosing wrote:
>>> Is it possible to replicate UPDATEs and DELETEs without a primary key in
>>> PostgreSQL-R
>> No. There must be some way to logically identify the tuple.
> It can be done as selecting on _all_ attributes and updating/deleting
> just the first matching row
> 
> create cursor ...
> select from t ... where t.* = ()
> fetch one ...
> delete where current of ...

That doesn't sound like it could possibly work for Postgres-R. At least
not when there can be multiple rows with all the same attributes, i.e.
without a unique key constraint over all columns.

Otherwise, some nodes could detect two concurrent UPDATES as a conflict,
while other nodes select different rows and don't handle it as a conflict.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/17/2012 02:30 PM, Hannu Krosing wrote:
> Is it possible to replicate UPDATEs and DELETEs without a primary key in
> PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/16/2012 03:14 PM, Andres Freund wrote:
> Whats the data type of the "COID" in -R?

It's short for CommitOrderId, a 32bit global transaction identifier,
being wrapped-around, very much like TransactionIds are. (In that sense,
it's global, but unique only for a certain amount of time).

> In the patchset the output plugin has enough data to get the old xid and
> the new xid in the case of updates (not in the case of deletes, but
> thats a small bug and should be fixable with a single line of code), and
> it has enough information to extract the primary key without problems.

It's the xmin of the old tuple that Postgres-R needs to get the COID.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/16/2012 03:05 PM, Andres Freund wrote:
>> I'd like to provide a comparison of the proposed change set format to
>> the one used in Postgres-R.
> 
> Uh, sorry to interrupt you right here, but thats not the "proposed
> format" ;)

Understood. Sorry, I didn't mean to imply that. It's pretty obvious to
me that this is more of a human readable format and that others,
including binary formats, can be implemented. I apologize for the bad
wording of a "proposed format", which doesn't make that clear.

>> The Postgres-R approach is independent of WAL and its format, where as
>> the approach proposed here clearly is not. Either way, there is a
>> certain overhead - however minimal it is - which the former adds to the
>> transaction processing itself, while the later postpones it to a
>> separate XLogReader process. If there's any noticeable difference, it
>> might reduce latency in case of asynchronous replication, but can only
>> increase latency in the synchronous case. As far as I understood Andres,
>> it was easier to collect the additional meta data from within the
>> separate process.
> 
> There also is the point that if you do the processing inside heap_* you
> need to make sure the replication targeted data is safely received &
> fsynced away, in "our" case thats not necessary as WAL already provides
> crash safety, so should the replication connection break you can simply
> start from the location last confirmed as being safely sent.

In the case of Postgres-R, the "safely received" part isn't really
handled at the change set level at all. And regarding the fsync
guarantee: you can well use the WAL to provide that, without basing
change set generation on in. In that regard, Postgres-R is probably the
more general approach: you can run Postgres-R with WAL turned off
entirely - which may well make sense if you take into account the vast
amount of cloud resources available, which don't have a BBWC. Instead of
WAL, you can add more nodes at more different locations. And no, you
don't want your database to ever go down, anyway  :-)

>> In summary, I'd say that Postgres-R is an approach specifically
>> targeting and optimized for multi-master replication between Postgres
>> nodes, where as the proposed patches are kept more general.
> 
> One major aim definitely was optionally be able to replicate into just
> about any target system, so yes, I certainly agree.

I'm glad I got that correct ;-)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-16 Thread Markus Wanner

Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:
> In response to this you will soon find the 14 patches that currently
> implement $subject.

Congratulations on that piece of work.

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.

Let's start with the representation: I so far considered the Postgres-R
change set format to be an implementation detail and I don't intend it
to be readable by humans or third party tools. It's thus binary only and
doesn't offer a textual representation. The approach presented here
seems to target different formats for different audiences, including
binary representations. More general, less specific.

Next, contents: this proposal is more verbose. In the textual
representation shown, it provides (usually redundant) information about
attribute names and types. Postgres-R doesn't ever transmit attribute
name or type information for INSERT, UPDATE or DELETE operations.
Instead, it relies on attribute numbers and pg_attributes being at some
known consistent state.

Let's compare by example:

> table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 
> text[varchar]:1
> table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 
> text[varchar]:1
> table "replication_example": DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

> table "replication_example": INSERT: VALUES (1, 1, '1')
> table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
> table "replication_example": DELETE: PKEY(1) COID(78)

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.

To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.

In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

I hope you found this to be an insightful and fair comparison.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-13 Thread Markus Wanner

On 11/13/2012 01:22 AM, Greg Smith wrote:
> Once you accept that eventually there need to be online conversion
> tools, there needs to be some easy way to distinguish which pages have
> been processed for several potential implementations.

Agreed. What I'm saying is that this identification doesn't need to be
as fine grained as a per-page bit. A single "horizon" or "border" is
enough, given an ordering of relations (for example by OID) and an
ordering of pages in the relations (obvious).

> All of the table-based checksum enabling ideas ...

This is not really one - it doesn't allow per-table switching. It's just
meant to be a more compact way of representing which pages have been
checksummed and which not.

> I'm thinking of this in some ways like the way creation of a new (but
> not yet valid) foreign key works.  Once that's active, new activity is
> immediately protected moving forward.  And eventually there's this
> cleanup step needed, one that you can inch forward over a few days.

I understand that. However, I question if users really care. If a
corruption is detected, the clever DBA tells his trainee immediately
check the file- and disk subsystem - no matter whether the corruption
was on old or new data.

You have a point in that pages with "newer" data are often more likely
to be re-read and thus getting checked. Where as the checksums written
to pages with old data might not be re-read any time soon. Starting to
write checksums from the end of the relation could mitigate this to some
extent, though.

Also keep in mind the "quietly corrupted after checked once, but still
in the middle of checking a relation" case. Thus a single bit doesn't
really give us the guarantee you ask for. Sure, we can add more than one
bit. And yeah, if done properly, adding more bits exponentially reduces
the likeliness of a corruption inadvertently turning off checksumming
for a page.

All that said, I'm not opposed to using a few bits of the page header. I
wanted to outline an alternative that I think is viable and less intrusive.

> This is why I think any good solution to this problem needs to
> incorporate restartable conversion.

I fully agree to that.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/12/2012 06:52 PM, Jeff Davis wrote:
> OK, so here's my proposal for a first patch (changes from Simon's
> patch):
> 
>   * Add a flag to the postgres executable indicating that it should use
> checksums on everything. This would only be valid if bootstrap mode is
> also specified.
>   * Add a multi-state checksums flag in pg_control, that would have
> three states: OFF, ENABLING, and ON. It would only be set to ON during
> bootstrap, and in this first patch, it would not be possible to set
> ENABLING.
>   * Remove GUC and use this checksums flag everywhere.
>   * Use the TLI field rather than the version field of the page header.
>   * Incorporate page number into checksum calculation (already done).
>   
> Does this satisfy the requirements for a first step? Does it interfere
> with potential future work?

As described before in this thread, I think we might be able to do
without the "has checksum"-bit, as yet another simplification. But I
don't object to adding it, either.

> It's slightly better than that. It's more like: "we can tell you if any
> of your data gets corrupted after transaction X". If old data is
> corrupted before transaction X, then there's nothing we can do. But if
> it's corrupted after transaction X (even if it's old data), the
> checksums should catch it.

I (mis?)read that as Greg referring to the intermediate (enabling)
state, where pages with old data may or may not have a checksum, yet. So
I think it was an argument against staying in that state any longer than
necessary.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 10:44 AM, Craig Ringer wrote:
> That'll make it hard for VACUUM, hint-bit setting, etc to
> opportunistically checksum pages whenever they're doing a page write anyway.

It *is* a hard problem, yes. And the single bit doesn't really solve it.
So I'm arguing against opportunistically checksumming in general. Who
needs that anyway?

> Is it absurd to suggest using another bitmap, like the FSM or visibility
> map, to store information on page checksumming while checksumming is
> enabled but incomplete?

Not absurd. But arguably inefficient, because that bitmap may well
become a bottleneck itself. Plus there's the problem of making sure
those pages are safe against corruptions, so you'd need to checksum the
checksum bitmap... doesn't sound like a nice solution to me.

This has certainly been discussed before.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 05:55 AM, Greg Smith wrote:
> Adding an initdb option to start out with everything checksummed seems
> an uncontroversial good first thing to have available.

+1

So the following discussion really is for a future patch extending on
that initial checkpoint support.

> One of the really common cases I was expecting here is that conversions
> are done by kicking off a slow background VACUUM CHECKSUM job that might
> run in pieces.  I was thinking of an approach like this:
> 
> -Initialize a last_checked_block value for each table
> -Loop:
> --Grab the next block after the last checked one
> --When on the last block of the relation, grab an exclusive lock to
> protect against race conditions with extension
> --If it's marked as checksummed and the checksum matches, skip it
> ---Otherwise, add a checksum and write it out
> --When that succeeds, update last_checked_block
> --If that was the last block, save some state saying the whole table is
> checkedsummed

Perfect, thanks. That's the rough idea I had in mind as well, written
out in detail and catching the extension case.

> With that logic, there is at least a forward moving pointer that removes
> the uncertainty around whether pages have been updated or not.  It will
> keep going usefully if interrupted too.  One obvious this way this can
> fail is if:
> 
> 1) A late page in the relation is updated and a checksummed page written
> 2) The page is corrupted such that the "is this checksummed?" bits are
> not consistent anymore, along with other damage to it
> 3) The conversion process gets to this page eventually
> 4) The corruption of (2) isn't detected

IMO this just outlines how limited the use of the "is this checksummed"
bit in the page itself is. It just doesn't catch all cases. Is it worth
having that bit at all, given your block-wise approach above?

It really only serves to catch corruptions to *newly* dirtied pages
*during* the migration phase that *keep* that single bit set. Everything
else is covered by the last_checked_block variable. Sounds narrow enough
to be negligible. Then again, it's just a single bit per page...

> The only guarantee I see that we can give for online upgrades is that
> after a VACUUM CHECKSUM sweep is done, and every page is known to both
> have a valid checksum on it and have its checksum bits set, *then* any
> page that doesn't have both set bits and a matching checksum is garbage.

>From that point in time on, we'd theoretically better use that bit as an
additional checksum bit rather than requiring it to be set all times.
Really just theoretically, I'm certainly not advocating a 33 bit
checksum :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/10/2012 12:08 AM, Jeff Davis wrote:
> The bit indicating that a checksum is present may be lost due to
> corruption.

Hm.. I see.

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.

Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?

That would mean having to re-calculate the checksums on pages that got
dirtied before VACUUM came along to migrate them to having a checksum,
but that seems acceptable. VACUUM could even detect that case and
wouldn't have to re-write it with the same contents.

I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

On 11/09/2012 07:53 PM, Jeff Davis wrote:
> One problem is telling which pages are protected and which aren't. We
> can have a couple bits in the header indicating that a checksum is
> present, but it's a little disappointing to have only a few bits
> protecting a 16-bit checksum.

Given your description of option 2 I was under the impression that each
page already has a bit indicating whether or not the page is protected
by a checksum. Why do you need more bits than that?

> Also, I think that people will want to have a way to protect their old
> data somehow.

Well, given that specific set of users is not willing to go through a
rewrite of each and every page of its database, it's hard to see how we
can protect their old data better.

However, we certainly need to provide the option to go through the
rewrite for other users, who are well willing to bite that bullet.

>From a users perspective, the trade-off seems to be: if you want your
old data to be covered by checksums, you need to go through such an
expensive VACUUM run that touches every page in your database.

If you don't want to or cannot do that, you can still turn on
checksumming for newly written pages. You won't get full protection and
it's hard to tell what data is protected and what not, but it's still
better than no checksumming at all. Especially for huge databases, that
might be a reasonable compromise.

One could even argue, that this just leads to a prolonged migration and
with time, the remaining VACUUM step becomes less and less frightening.

Do you see any real foot-guns or other show-stoppers for permanently
allowing that in-between-state?

Or do we have other viable options that prolong the migration and thus
spread the load better over time?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

On 11/09/2012 06:18 AM, Jesper Krogh wrote:
> I would definately stuff our system in state = 2 in your
> description if it was available.

Hm.. that's an interesting statement.

What's probably worst when switching from OFF to ON is the VACUUM run
that needs to touch every page (provided you haven't ever turned
checksumming on before). Maybe you want to save that step and still get
the additional safety for newly dirtied pages, right?

A use case worth supporting?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

Jeff,

On 11/09/2012 02:01 AM, Jeff Davis wrote:
> For the sake of simplicity (implementation as well as usability), it
> seems like there is agreement that checksums should be enabled or
> disabled for the entire instance, not per-table.

Agreed. I've quickly thought about making it a per-database setting, but
how about shared system catalogs... Let's keep it simple and have a
single per-cluster instance switch for now.

> I don't think a GUC entirely makes sense (in its current form, anyway).
> We basically care about 3 states:
>   1. Off: checksums are not written, nor are they verified. Pages that
> are newly dirtied have the checksum information in the header cleared.
>   2. Enabling: checksums are written for every dirty page, but only
> verified for pages where the checksum is present (as determined by
> information in the page header).
>   3. On: checksums are written for every dirty page, and verified for
> every page that's read. If a page does not have a checksum, it's
> corrupt.

Sounds sane, yes.

> And the next question is what commands to add to change state. Ideas:
> 
>CHECKSUMS ENABLE; -- set state to "Enabling"
>CHECKSUMS DISABLE; -- set state to "Off"

Yet another SQL command doesn't feel like the right thing for such a
switch. Quick googling revealed that CHECKSUM is a system function in MS
SQL and MySQL knows a CHECKSUM TABLE command. And you never know what
the committee is coming up with next.

Apart from that, I'd like something more descriptive that just
"checksums". Block checksums? Heap checksums? Data checksums?

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical to physical page mapping

2012-10-30 Thread Markus Wanner

On 10/29/2012 12:05 PM, Robert Haas wrote:
> OK, I'll stop babbling now...

Not perceived as babbling here. Thanks for that nice round-up of options
and ideas around the torn page problem.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Simon,

On 10/17/2012 10:34 AM, Simon Riggs wrote:
> IMHO an API is required for "give me the next allocation of numbers",
> essentially a bulk equivalent of nextval().

Agreed. That pretty exactly matches what I described (and what's
implemented in Postgres-R). The API then only needs to be called every N
invocations of nextval(), because otherwise nextval() can simply return
a cached number previously allocated in a single step, eliminating a lot
of the communication overhead.

You realize an API at that level doesn't allow for an implementation of
options 1 and 2? (Which I'm convinced we don't need, so that's fine with
me).

> Anything lower level is going to depend upon implementation details
> that I don't think we should expose.

Exactly. Just like we shouldn't expose other implementation details,
like writing to system catalogs or WAL.

> I'm sure there will be much commonality between 2 similar
> implementations, just as there is similar code in each index type. But
> maintaining modularity is important and ahead of us actually seeing 2
> implementations, trying to prejudge that is going to slow us all down
> and likely screw us up.

Agreed. Let me add, that modularity only serves a purpose, if the
boundaries between the modules are chosen wisely. It sounds like we are
on the same page, though.

To testify this: IMHO an API for setval() is required to invalidate all
node's caches and re-set an initial value, as a starting point for the
next bulk of numbers that nextval() will return.

currval() doesn't need to be changed or "hooked" at all, because it's a
read-only operation.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Tom,

On 10/16/2012 06:15 PM, Tom Lane wrote:
> I challenge you to find anything in the SQL standard that suggests that
> sequences have any nonlocal behavior.  If anything, what you propose
> violates the standard, it doesn't make us follow it more closely.

If you look at a distributed database as a transparent equivalent of a
single-node system, I'd say the SQL standard applies to the entire
distributed system. From that point of view, I'd rather argue that any
"local-only" behavior violates the standard.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Simon,

On 10/16/2012 02:36 PM, Simon Riggs wrote:
> Where else would you put the hook? The hook's location as described
> won't change whether you decide you want 1, 2 or 3.

You assume we want an API that supports all three options. In that case,
yes, the hooks need to be very general.

Given that option 3 got by far the most support, I question whether we
need such a highly general API. I envision an API that keeps the
bookkeeping and cache lookup functionality within Postgres. So we have a
single, combined-effort, known working implementation for that.

What remains to be done within the plugin effectively is the consensus
problem: it all boils down to the question of which node gets the next
chunk of N sequence numbers. Where N can be 1 (default CACHE setting in
Postgres) or any higher number for better performance (reduces the total
communication overhead by a factor of N - or at least pretty close to
that, if you take into account "lost" chucks due to node failures).

A plugin providing that has to offer a method to request for a global
ordering and would have to trigger a callback upon reaching consensus
with other nodes on who gets the next chunk of sequence numbers. That
works for all N >= 1. And properly implements option 3 (but doesn't
allow implementations of options 1 or 2, which I claim we don't need,
anyway).

> Implementations will be similar, differing mostly in the topology and
> transport layer

I understand that different users have different needs WRT transport
layers - moving the hooks as outlined above still allows flexibility in
that regard.

What different topologies do you have in mind? I'd broadly categorize
this all as multi-master. Do you need finer grained differentiation? Or
do you somehow include slaves (i.e. read-only transactions) in this process?

As you yourself are saying, implementations will only differ in that
way, let's keep the common code the same. And not require plugins to
duplicate that. (This also allows us to use the system catalogs for book
keeping, as another benefit).

> which means its not going to be possible to provide
> such a thing initially without slowing it down to the point we don't
> actually get it at all.

Sorry, I don't quite understand what you are trying to say, here.

Overall, thanks for bringing this up. I'm glad to see something
happening in this area, after all.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-16 Thread Markus Wanner

On 10/16/2012 12:47 AM, Josh Berkus wrote:
> I'd also love to hear from the PostgresXC folks on whether this solution
> works for them.  Postgres-R too.

In Postgres-R, option 3) is implemented. Though, by default sequences
work just like on a single machine, giving you monotonically increasing
sequence values - independent from the node you call nextval() from. IMO
that's the user's expectation. (And yes, this has a performance penalty.
But no, there's no compromise in availability).

It is implemented very much like the per-backend cache we already have
in vanilla Postgres, but taken to the per-node level. This gives the
user a nice compromise between strongly ordered and entirely random
values, allowing fine-tuning the trade off between performance and
laziness in the ordering (think of CACHE 10 vs. CACHE 1).

> If it works for all three of those
> tools, it's liable to work for any potential new tool.

In Postgres-R, this per-node cache uses additional attributes in the
pg_sequence system catalog to store state of this cache. This is
something I'm sure is not feasible to do from within a plugin.

Why does a "Global Sequences" API necessarily hook at the nextval() and
setval() level? That sounds like it yields an awkward amount of
duplicate work. Reading this thread, so far it looks like we agree that
option 3) is the most feasible optimization (the strict ordering being
the un-optimized starting point). Do we really need an API that allows
for implementations of options 1) and 2)?

What I'd appreciate more is a common implementation for option 3) with
an API to plug in different solutions to the underlying consensus problem.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

On 10/11/2012 04:06 PM, Tom Lane wrote:
> Yeah, if we decide to stick with the limitation, some documentation
> would be called for.  I remember having run into this and having removed
> functionality from an rm_desc function rather than question the premise.
> But maybe the extra functionality is worth the cycles.

Well, I've been interested in getting it correct, and didn't care about
performance.

But I can certainly imagine someone enabling it on a production system
to get more debug info. In which case performance would matter more.
However, WAL_DEBUG being a #define, I bet only very few admins do that.
So I tend towards sacrificing performance for better debug info in the
WAL_DEBUG case. (Especially given that you can still disable it via GUC).

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

Tom,

On 10/11/2012 03:11 PM, Tom Lane wrote:
> The original design intention was that rm_desc should not attempt to
> print any such data, but obviously some folks didn't get the word.

FWIW: in case the source code contains comments explaining that
intention, I certainly missed them so far.

Regards

Markus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

Hi,

I stumbled across a minor issue in xlog.c:1030: the WAL_DEBUG code block
there passes rdata->data to the rm_desc() methode. However, that's only
the first XLogRecData struct, not the entire XLog record. So the
rm_desc() method effectively reports spurious data for any subsequent part.

Take a commit record with subxacts as an example: during XLogInsert,
Postgres reports the following:

LOG:  INSERT @ 0/16F3270: prev 0/16F3234; xid 688; len 16: Transaction -
commit: 2012-10-11 09:31:17.790368-07; subxacts: 3214563816

Note that the xid in subxacts is way off. During recovery from WAL, the
record is logged correctly:

LOG:  REDO @ 0/16F3270; LSN 0/16F329C: prev 0/16F3234; xid 688; len 16:
Transaction - commit: 2012-10-11 09:31:17.790368-07; subxacts: 689

Attached is a possible fix.

Regards

Markus Wanner
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1033,1039  begin:;
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
! 		StringInfoData buf;
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
--- 1033,1056 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
! 		StringInfoData	buf;
! 		char		   *full_rec = palloc(write_len), *ins_ptr;
! 
! 		/*
! 		 * We need assemble the entire record once, to be able to dump it out
! 		 * properly.
! 		 */
! 		rdt = rdata;
! 		ins_ptr = full_rec;
! 		while (rdt)
! 		{
! 			if (rdt->data != NULL)
! 			{
! memcpy(ins_ptr, rdt->data, rdt->len);
! ins_ptr += rdt->len;
! 			}
! 			rdt = rdt->next;
! 		}
  
  		initStringInfo(&buf);
  		appendStringInfo(&buf, "INSERT @ %X/%X: ",
***
*** 1042,1051  begin:;
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rechdr->xl_rmid].rm_desc(&buf, rechdr->xl_info, rdata->data);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
  	}
  #endif
  
--- 1059,1069 
  		if (rdata->data != NULL)
  		{
  			appendStringInfo(&buf, " - ");
! 			RmgrTable[rechdr->xl_rmid].rm_desc(&buf, rechdr->xl_info, full_rec);
  		}
  		elog(LOG, "%s", buf.data);
  		pfree(buf.data);
+ 		pfree(full_rec);
  	}
  #endif
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-13 Thread Markus Wanner

On 07/13/2012 11:33 AM, Markus Wanner wrote:
> As another minor improvement, it doesn't seem necessary to repeatedly
> set the rootBlkno.

Sorry, my mail program delivered an older version of the patch, which
didn't reflect that change. Here's what I intended to send.

Regards

Markus Wanner
#
# Minor cleanup of ginInsertValue with additional asserts and
# a tighter loop for finding the root in the GinBtreeStack.
#

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	f6dc88ae5716275e62a6f6715aa7204abd430089
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,294 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
  
! 	/* remember root BlockNumber */
! 	while (parent)
! 	{
! 		rootBlkno = parent->blkno;
  		parent = parent->parent;
- 	}
  
! 	while (stack)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
--- 276,296 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno;
  	Page		page,
  rpage,
  lpage;
  
! 	/* extract root BlockNumber from stack */
! 	Assert(stack != NULL);
! 	parent = stack;
! 	while (parent->parent)
  		parent = parent->parent;
  
! 	rootBlkno = parent->blkno;
! 	Assert(BlockNumberIsValid(rootBlkno));
! 
! 	for (;;)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 469,474 
--- 471,477 
  
  		UnlockReleaseBuffer(stack->buffer);
  		pfree(stack);
+ 		Assert(parent != NULL);		/* parent == NULL case is handled above */
  		stack = parent;
  	}
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-13 Thread Markus Wanner

On 07/11/2012 05:45 AM, Tom Lane wrote:
> I'm also inclined to think that the "while (stack)" coding of the rest
> of it is wrong, misleading, or both, on precisely the same grounds: if
> that loop ever did fall out at the test, the function would have failed
> to honor its contract.  The only correct exit points are the "return"s
> in the middle.

I came to the same conclusion, yes. Looks like the additional asserts in
the attached patch all hold true.

As another minor improvement, it doesn't seem necessary to repeatedly
set the rootBlkno.

Regards

Markus Wanner
#
# old_revision [d90761edf47c6543d4686a76baa0b4e2a7ed113b]
#
# patch "src/backend/access/gin/ginbtree.c"
#  from [2d3e63387737b4034fc25ca3cb128d9ac57f4f01]
#to [62efe71b8e3429fc1bf2d2742f8d2fa69f2f4de6]
#

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	62efe71b8e3429fc1bf2d2742f8d2fa69f2f4de6
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,294 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
  
  	/* remember root BlockNumber */
! 	while (parent)
  	{
  		rootBlkno = parent->blkno;
  		parent = parent->parent;
! 	}
  
! 	while (stack)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
--- 276,298 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno;
  	Page		page,
  rpage,
  lpage;
  
  	/* remember root BlockNumber */
! 	Assert(stack != NULL);
! 	parent = stack;
! 	do
  	{
  		rootBlkno = parent->blkno;
  		parent = parent->parent;
! 	} while (parent);
  
! 	Assert(BlockNumberIsValid(rootBlkno));
! 
! 	for (;;)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 469,474 
--- 473,479 
  
  		UnlockReleaseBuffer(stack->buffer);
  		pfree(stack);
+ 		Assert(parent != NULL);		/* parent == NULL case is handled above */
  		stack = parent;
  	}
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-10 Thread Markus Wanner

Hackers,

I stumbled across an initialization of a BlockNumber with InvalidBuffer,
which seems strange to me, as the values for "invalid" of the two types
are different, see attached patch.

In case the 'stack' argument passed to that function is not NULL, the
variable in question gets overridden immediately, in which case it
certainly doesn't matter. I don't know nor did I check whether or not it
can ever be NULL. So this might not be a real issue at all.

Regards

Markus Wanner
# InvalidBlockNumber is -1 (or rather 0x), while
# the currently used InvalidBuffer is 0, which is a valid
# BlockNumber.

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	67351e1b6541b25ab3c8e8dc7a57487c2422e124
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,282 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
--- 276,282 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBlockNumber;
  	Page		page,
  rpage,
  lpage;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pruning the TODO list

2012-06-30 Thread Markus Wanner

Hi,

On 06/22/2012 05:38 PM, Andrew Dunstan wrote:
> I think the real problem with the TODO list is that some people see it
> as some sort of official roadmap, and it really isn't. Neither is there
> anything else that is.

To me, it looks like TODO is just a misnomer. The file should be named
TODISCUSS, IDEAS, or something. But the current file name implies consensus.

Wouldn't renaming solve that kind of misunderstanding? (..in the vain of
"address(ing) real problems in the simplest way"..)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Tom,

On 08/25/2011 04:59 PM, Tom Lane wrote:
> That's a good point.  If the ring buffer size creates a constraint on
> the maximum number of sub-XIDs per transaction, you're going to need a
> fallback path of some sort.

I think Robert envisions the same fallback path we already have:
subxids.overflowed.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Robert,

On 08/25/2011 04:48 PM, Robert Haas wrote:
> What's a typical message size for imessages?

Most message types in Postgres-R are just a couple bytes in size.
Others, especially change sets, can be up to 8k.

However, I think you'll have an easier job guaranteeing that backends
"consume" their portions of the ring-buffer in time.  Plus wrap-around
isn't that much of a problem in your case.  (I couldn't drop imessage,
but had to let senders wait).

> Well, one long-running transaction that only has a single XID is not
> really a problem: the snapshot is still small.  But one very old
> transaction that also happens to have a large number of
> subtransactions all of which have XIDs assigned might be a good way to
> stress the system.

Ah, right, that's why its a list of transactions in progress and not a
list of completed transactions in SnapshotData... good.

> Each reader decides which data he needs to copy from the buffer, and
> then copies it, and then checks whether any of it got overwritten
> before the copy was completed.  So there's a lively possibility that
> the snapshot that was current when the reader began copying it will no
> longer be current by the time he finishes copying it, because a commit
> has intervened.  That's OK: it just means that, effectively, the
> snapshot is taken at the moment the start and stop pointers are read,
> and won't take into account any commits that happen later, which is
> exactly what a snapshot is supposed to do anyway.

Agreed, that makes sense.  Thanks for explaining.

> There is a hopefully quite small possibility that by the time the
> reader finishes copying it so much new data will have been written to
> the buffer that it will have wrapped around and clobbered the portion
> the reader was interested in.  That needs to be rare.

Yeah.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Robert,

On 08/25/2011 03:24 PM, Robert Haas wrote:
> My hope (and it might turn out that I'm an optimist) is that even with
> a reasonably small buffer it will be very rare for a backend to
> experience a wraparound condition.

It certainly seems less likely than with the ring-buffer for imessages, yes.

Note, however, that for imessages, I've also had the policy in place
that a backend *must* consume its message before sending any.  And that
I took great care for all receivers to consume their messages as early
as possible.  None the less, I kept incrementing the buffer size (to
multiple megabytes) to make this work.  Maybe I'm overcautious because
of that experience.

> - a handful of XIDs at most - because, on the average, transactions
> are going to commit in *approximately* increasing XID order

This assumption quickly turns false, if you happen to have just one
long-running transaction, I think.  Or in general, if transaction
duration varies a lot.

> So the backend taking a snapshot only needs
> to be able to copy < ~64 bytes of information from the ring buffer
> before other backends write ~27k of data into that buffer, likely
> requiring hundreds of other commits.

You said earlier, that "only the latest snapshot" is required.  It takes
only a single commit for such a snapshot to not be the latest anymore.

Instead, if you keep around older snapshots for some time - as what your
description here implies - readers are free to copy from those older
snapshots while other backends are able to make progress concurrently
(writers or readers of other snapshots).

However, that either requires keeping track of readers of a certain
snapshot (reference counting) or - as I understand your description -
you simply invalidate all concurrent readers upon wrap-around, or something.

> That seems vanishingly unlikely;

Agreed.

> Now, as the size of the snapshot gets bigger, things will eventually
> become less good.

Also keep configurations with increased max_connections in mind.  With
that, we not only the snapshots get bigger, but more processes have to
share CPU time, on avg. making memcpy slower for a single process.

> Of course even if wraparound turns out not to be a problem there are
> other things that could scuttle this whole approach, but I think the
> idea has enough potential to be worth testing.  If the whole thing
> crashes and burns I hope I'll at least learn enough along the way to
> design something better...

That's always a good motivation.  In that sense: happy hacking!

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Robert,

On 08/25/2011 04:59 AM, Robert Haas wrote:
> True; although there are some other complications.  With a
> sufficiently sophisticated allocator you can avoid mutex contention
> when allocating chunks, but then you have to store a pointer to the
> chunk somewhere or other, and that then requires some kind of
> synchronization.

Hm.. right.

> One difference with snapshots is that only the latest snapshot is of
> any interest.

Theoretically, yes.  But as far as I understood, you proposed the
backends copy that snapshot to local memory.  And copying takes some
amount of time, possibly being interrupted by other backends which add
newer snapshots...  Or do you envision the copying to restart whenever a
new snapshot arrives?

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Robert, Jim,

thanks for thinking out loud about dynamic allocation of shared memory.
 Very much appreciated.

On 08/23/2011 01:22 AM, Robert Haas wrote:
> With respect to a general-purpose shared memory allocator, I think
> that there are cases where that would be useful to have, but I don't
> think there are as many of them as many people seem to think.  I
> wouldn't choose to implement this using a general-purpose allocator
> even if we had it, both because it's undesirable to allow this or any
> subsystem to consume an arbitrary amount of memory (nor can it fail...
> especially in the abort path) and because a ring buffer is almost
> certainly faster than a general-purpose allocator.

I'm in respectful disagreement regarding the ring-buffer approach and
think that dynamic allocation can actually be more efficient if done
properly, because there doesn't need to be head and tail pointers, which
might turn into a point of contention.

As a side note: that I've been there with imessages.  Those were first
organized as a ring-bufffer.  The major problem with that approach was
the imessages were consumed with varying delay.  In case an imessage was
left there for a longer amount of time, it blocked creation of new
imessages, because the ring-buffer cycled around once and its head
arrived back at the unconsumed imessage.

IIUC (which might not be the case) the same issue applies for snapshots.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Hello Dimitri,

On 08/23/2011 06:39 PM, Dimitri Fontaine wrote:
> I'm far from familiar with the detailed concepts here, but allow me to
> comment.  I have two open questions:
> 
>  - is it possible to use a distributed algorithm to produce XIDs,
>something like Vector Clocks?
> 
>Then each backend is able to create a snapshot (well, and XID) on its
>own, and any backend is still able to compare its snapshot to any
>other snapshot (well, XID)

Creation of snapshots and XID assignment are not as related as you imply
here.  Keep in mind that a read-only transaction have a snapshot, but no
XID.  (Not sure if it's possible for a transaction to have an XID, but
no snapshot.  If it only touches system catalogs with SnapshotNow,
maybe?  Don't think we support that, ATM).

>  - is it possible to cache the production of the next snapshots so that
>generating an XID only means getting the next in a pre-computed
>vector?

The way I look at it, what Robert proposed can be thought of as "cache
the production of the next snapshot", with a bit of a stretch of what a
cache is, perhaps.  I'd rather call it "early snapshot creation", maybe
"look-ahead something".

ATM backends all scan ProcArray to generate their snapshot.  Instead,
what Robert proposes would - sometimes, somewhat - move that work from
snapshot creation time to commit time.

As Tom points out, the difficulty lies in the question of when it's
worth doing that:  if you have lots of commits in a row, and no
transaction ever uses the (pre generated) snapshots of the point in time
in between, then those were wasted.  OTOH, if there are just very few
COMMITs spread across lots of writes, the read-only backends will
re-create the same snapshots, over and over again.  Seems wasteful as
well (as GetSnapshotData popping up high on profiles confirms somewhat).

Hope to have cleared up things a bit.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] POSIX question

2011-06-26 Thread Markus Wanner

Radek,

On 06/20/2011 03:27 PM, Radosław Smogura wrote:
> When playing with mmap I done some approach how to deal with growing
> files, so...

Your approach seems to require a SysV alloc (for nattach) as well as
POSIX shmem and/or mmap.  Adding requirements for these syscalls
certainly needs to give a good benefit for Postgres, as they presumably
pose portability issues.

> 3. a. Lock header when adding chunks (1st chunk is header) (we don't
> want concurrent chunk allocation)

Sure we don't?  There are at least a dozen memory allocators for
multi-threaded applications, all trying to optimize for concurrency.
The programmer of a multi-threaded application doesn't need to care much
about concurrent allocations.  He can allocate (and free) quite a lot of
tiny chunks concurrently from shared memory.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exporting closePGconn from libpq

2011-05-16 Thread Markus Wanner

Leon,

On 05/14/2011 05:23 PM, Leon Smith wrote:
> A minor issue has come up in creating low-level bindings to libpq for
> safe garbage-collected languages,  namely that PQfinish is the only
> (AFAICT) way to close a connection but also de-allocates the memory
> used to represent the database connection.It would be preferable
> to call PQfinish to free the memory in a finalizer,  but appilcations
> need a way to disconnect from the database at a predictable and
> deterministic point in time,  whereas leaving a bit of memory around
> until the GC finally gets to it is relatively harmless.

It's harmless, but I think it's also useless.  Or why do you want to
keep that (libpq-private) bit of memory around beyond PQfinish()?

I'm not sure what language or VM you have in mind, but your description
sounds like you are writing a wrapper by definition.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stored procedures" - use cases?

2011-05-12 Thread Markus Wanner

Hi,

On 05/10/2011 02:55 PM, Robert Haas wrote:
> On Mon, May 9, 2011 at 11:58 PM, Pavel Stehule  
> wrote:
>> no - you are little bit confused :). CALL and function execution
>> shares nothing. There is significant differences between function and
>> procedure. Function is called only from executor - from some plan, and
>> you have to know a structure of result before run. The execution of
>> CALL is much simple - you just execute code - without plan and waiting
>> for any result - if there is.

I think the distinction between function and procedure is misleading
here.  Some envision stored *procedures* to be able to return values,
result sets and possibly even *multiple* result sets.

> The main features seem to be (1) explicit transaction control
> and/or execution of commands like VACUUM that can't be invoked from
> within a transaction,

I think that's the main point of stored procedures.

> (2) autonomous transactions

To me autonomous transactions seem orthogonal.  Those can be used to
implement (1) above, but might have other uses for regular transactions
as well.

(The point I'm taking home here is that you might want to control not
only one concurrent transaction, but several from a "stored procedure".
 So far, I assumed only one.)

> and (3) returning
> multiple result sets.  But I don't think anybody would be desperately
> unhappy if it magically became possible to do those things from
> regular functions, unlikely as that may seem.

That point definitely is on my wish-list for UDFs already.  I didn't
think of this as having to do with stored procedures, either.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "full_page_writes" makes no difference?

2011-05-04 Thread Markus Wanner

Hi,

On 05/04/2011 03:46 AM, Tian Luo wrote:
> No matter I turn on or turn off the "full_page_writes", I always
> observe 8192-byte writes of log data for simple write operations
> (write/update).

How did you measure that?  A single transaction doing a single write, I
guess.  Ever tried multiple transactions with a simple write operation
each and checking how much WAL that spits out per transaction?

As I understand it, dirty blocks are written to disk as soon as
feasible.  After all, that helps crash recovery.  With a basically idle
system, "as soon as feasible" might be pretty soon.  However, put your
(disk sub-) system under load and "as soon as feasible" might take awhile.

> But according to the document, when this is off, it could speed up
> operations but may cause problems during recovery. So, I guess this is
> because it writes less when the option is turned off. However, this
> contradicts my observations 

I think you didn't trigger the savings.  It's about writing full pages
on the first write to a block after a checkpoint.  Did you monitor
checkpoint times of Postgres in your tests?

> If I am not missing anything, I find that the writes of log data go
> through function "XLogWrite" in source file
> "backend/access/transam/xlog.c".
> 
> In this file, log data are written with the following code:
> 
> from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
> nbytes = npages * (Size) XLOG_BLCKSZ;
> if (write(openLogFile, from, nbytes) != nbytes)
> {
>  ...
> }
> 
> So, "nbytes" should always be multiples of XLOG_BLCKSZ, which in the
> default case, is 8192.

That observation seems correct.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal - asynchronous functions

2011-04-27 Thread Markus Wanner

On 04/26/2011 11:17 PM, Robert Haas wrote:
> IIRC, we kind of got stuck on the prerequisite wamalloc patch, and that sunk 
> the whole thing.  :-(

Right, that prerequisite was the largest stumbling block.  As I
certainly mentioned back then, it should be possible to get rid of the
imessages dependency (and thus wamalloc).  So whoever really wants to
implement asynchronous functions (or autonomous transactions) is more
than welcome to try that.

Please keep in mind that you'd need an alternative communication path.
Not only for the bgworker infrastructure itself, but for communication
between the requesting backend and the bgworker (except for
fire-and-forget jobs like autovacuum, of course.  OTOH even those could
benefit from communicating back their state to the coordinator.. eh..
autovacuum launcher).

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal - asynchronous functions

2011-04-26 Thread Markus Wanner

Robert,

On 04/26/2011 02:25 PM, Robert Haas wrote:
> We've talked about a number of features that could benefit from some
> kind of "worker process" facility (e.g. logical replication, parallel
> query).  So far no one has stepped forward to build such a facility,
> and I think without that this can't even get off the ground.

Remember the bgworker patches extracted from Postgres-R?

[ Interestingly enough, one of the complaints I heard back then (not
necessarily from you) was that there's no user for bgworkers, yet.
Smells a lot like a chicken and egg problem to me. ]

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-23 Thread Markus Wanner

On 03/23/2011 12:52 PM, Robert Haas wrote:
> Yes.  What this won't do is let you build a big load-balancing network
> (at least not without great caution about what you assume).

This sounds too strong to me.  Session-aware load balancing is pretty
common these days.  It's the default mode of PgBouncer, for example.
Not much caution required there, IMO.  Or what pitfalls did you have in
mind?

> What it
> will do is make it really, really hard to lose committed transactions.
> Both good things, but different.

..you can still get both at the same time.  At least as long as you are
happy with session-aware load balancing.  And who really needs finer
grained balancing?

(Note that no matter how fine-grained you balance, you are still bound
to a (single core of a) single node.  That changes with distributed
querying, and things really start to get interesting there... but we are
far from that, yet).

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 >

1 - 100 of 500 matches

Mail list logo