Re: [habari-dev] [PostgreSQL] - SCHEMA.SQL - CREATE SEQUENCE Improvements

Chris Meller Wed, 22 Jun 2011 15:12:52 -0700

I've skimmed over the chat logs you referenced, but I don't see the vastmajority of these points discussed... Primarily all I see is adiscussion about using ID alone as the primary key vs. other fields. SoI'm going to go through (below) and ask several questions to clarify.

------------------------------------------------------------------------


        cga <mailto:[email protected]>
June 22, 2011 3:13 PM


Hi all,

<PREMISE>
you might forgive me if I don't use Trac, but I'd like to explain and discuss
what I'm doing with PGSQL schema.sql for Habari before submitting (if my
contrib will be accepted code wise) anything.

Not to mention that in these very days things are getting moved on Github. If
so, I would clone and request for pull once there, since I have an active
account on Github and it seems pointless to get a Trac account now.

BUT: since this very patch isn't going (at least theoretically) to have any
bad effect on Habari , I'm attaching it to this very email for now. For you to
test. Forgive me on this last thing but I haven't set up a test env just yet.
I'll do it after sending this patch.

<little INTRO and WHY>
I'm cga. I'm a Linux SysAdmin and I'm all for "best practices" as much as I
can. Not to mention I'm really paranoid and analytical when it concerns
coding/installing/configuring/deploying. Especially for things I care/use/
administer.

I basically started this little contribution because I really love my stack:
debian + postgresql + nginx and I'd like to use it for all my needs. Habari
included. Ref: http://drunkenmonkey.org/irc/habari/2011-06-19#T17-04-24
I think it'sfair enough for a small intro.

<CONTRIB AND REASONING>
As you can read in the chat log I linked above , I'm not expert in neither
PostgreSQL nor SQL but I like to contribute for as much as I can. Especially
if this can be the only way to keep Habari's support for PostgreSQL alive.

This first little but important contribution it is all about the "CREATE
SEQUENCE" used in PGSQL actual schema.sql.

All the reasoning I do it's to be documented w/ links to official docs
and (attacched) chat logs from freenode.#postgresql , where (beside docs and
wiki) I learn and ask my stuff. And to my great pleasure I always find very
competent and very helpful people.

Please forgive my English errors, I'm Italian. Please *really* forgive
my verbosity but I'm sure you'd understand that people who didn't follow in
freenode.#habari don't know all the reasoning from the room. If I'm to be
accepted I promise I'll be as concise as possible in next communications.

Here we go:

I noticed that you use a lot of serialization in schema.sql , w/o questioning
why (which I'm not in a position to, anyway) I noticed that you (being whoever
did that schema.sql) do it wrong. Why?

Because: CREATE SEQUENCE done in the present form from schema.sql:

CREATE SEQUENCE {$prefix}posts_pkey_seq;
CREATE TABLE {$prefix}posts (
  id BIGINT NOT NULL DEFAULT nextval('{$prefix}posts_pkey_seq'),
.....
.....
  PRIMARY KEY (id),
.....
);


1 - It's *almost* correct, but no cigar. It basically miss the most crucial
part of the query to make the SEQUENCE itself have any sense *and* work
properly:

ALTER SEQUENCE  {$prefix}posts_pkey_seq OWNED BY{$prefix}posts.id;

The OWNED BY argument was only added in 8.2. The last time I tested thePG support (admittedly a long time ago) Debian only had 8.1 in itsrepos, so that was a major deal. I'm not sure what is commonly availablenow (though I suspect 8.3 is widely used), but breaking support for 8.1users is something to be considered, particularly since I see no realadvantage to adding the statement in our use case.

Per the Postgres documentation<http://www.postgresql.org/docs/8.3/static/sql-createsequence.html> onCREATE SEQUENCE, OWNED BY merely associates the sequence with the table(and field) specified so that if the table (or field) is dropped thesequence will be automatically dropped as well. This sounds like a goodidea at first, but I can see instances in which this might cause problems.

In the past there have been times that we've needed to modify a table,but one RDBMS or another won't support the right ALTER syntax, so we'vecreated a new field, migrated data, dropped the original field, and thenreversed the process. If we used OWNED BY on the sequence as you suggestthe sequence would be automatically dropped and likely cause a ton ofproblems. Not only would we have to remember to re-create the sequence(manually and specifically for Postgres), but we would have to guess atwhat the starting value should be, otherwise we would generate duplicateIDs - a Very Bad Thing (tm).

So the only advantage we'd get is that if the table were to be droppedthe sequence would be automatically deleted with it. That's nice for auser who is no longer using Habari, but hardly a major deal - it's anextra query per table to drop its relevant sequence or a few extraclicks in your GUI of choice. I can't see any instance in which we wouldbe dropping the Posts table (or most of the others) programmatically, soit really provides no extra value for developers.

w/o this ALTER TABLE the SEQUENCE is to behave weird more sooner than later,
resulting in duplicate numbers (see chat log; using BIGINT won't save you from
this, more on this later) and errors that are more than diffcult to find when
they happen.

I don't understand this. Can you explain how it's ever going to behavein an unexpected way? How would a sequence ever result in duplicatenumbers, unless it has been manually ALTERed with the RESTART argument?Even if we used OWNED BY, you could still screw the process up bymanually changing the sequence, so that wouldn't solve the problem either.

The use of the sequence, whether OWNED BY or not, should have nothing todo with whether the destination field is an INT or a BIGINT.

2 - it determines in a recursive way the name of the SEUQNECE itself , which
is wrong because you cannot be 100% sure that will actually work as intended.
(see chat log).

I'm not sure I understand your point here either. The sequence name isdynamically created during install based on the DB prefix you specify,yes. Again, how is that going to cause a problem and (aside from using aSERIAL, as mentioned below) what other alternative is there?

3 - it uses BIGINT. My reasoning is that whoever created this schema.sql
didn't wan't to finish the SEQUENCE too soon. Good idea , bad implementation. I
strongly believe that BIGINT is not needed indeed, and that it's just a waste
of cylces and space. Not to mention that BIGINT behaves differently on 32bit
and 64bit, causing slowness (and perhaps some more side effects) on 32bit.
(this is chat logs from the office, if you need them I'll attach them tomorrow).
And as explained just above, it is not going to be used as a BIGINT anyway
because of the wrong query CREATE SEQUENCE. But the main point why it is not
needed it's because who's to write a post every second for # quintillions
years? I think that a post every second for 68yrs is more than enough
(INTEGER/SERIAL). (see chat log too).

I cannot speak towards any performance differences between 32bit and64bit versions of Postgres, though I'm admittedly skeptical. We useBIGINT everywhere we can for ID columns (at least in MySQL and SQLite,looks like Postgres might have been missed) for one simple reason: wecan. The storage difference between INT and BIGINT varies betweenplatform but is so negligible we've never worried about it.

On the other hand there are instances in which it may be of value tohave the added capacity. A few off the top of my head:

1. Logs. Particularly if a plugin starts generating errors, you couldconceivably start generating a TON of IDs. Over the course of many yearsrunning Habari it's not entirely outside the realm of possibility thatyou could exhaust an INT.

2. Posts that aren't posts. While most of us use Habari only forblogging, there are several instances where there have been a lot ofother things built on top of it. Think product catalogs / online stores,forums, etc. It may seem crazy to create a new post every second for 68years, but it seems a lot less crazy if you stop thinking about eachentry as a time-consuming piece of written content. When you alsoinclude multiple types of content (like revisions, tweets, links, etc.)and consider that there might be a lot of turnover (or "churn") there'sall the more reason to go ahead and use a BIGINT.

If an INT column were to be maxed out it's a simple matter for a DBA togo in and change it (after figuring out what the problem actually was,of course), but recreating the indexes could be a very time consumingprocess as well. I believe this is actually one of the primary reasonsWordPress decided to switch to BIGINT fields many years ago.

So basically there are several reasons to use BIGINTs, even if they arelong-shots, and no real reason not to (aside from your suggestion thatthere's a performance difference, which we've never heard of before).Think of it as a rather hopeful best practice.

4 - when all the above are to be combined w/ the use of NOT NULL DEFAULT
nextval()" (and I mean all of them are compulsory clauses, see chat log), you
can save a lot of hassles, code and cycles with just the improvement I'm
submitting for every CREATE SEQUENCE query, leaving intact the final result for
what it concerns Habari's functioning *and* improves things a whole lotta
love:

CREATE TABLE {$prefix}posts (
  id SERIAL,
.....
.....
  PRIMARY KEY (id),
.....
);

Ignoring everything in the last 3 points related to sequences, I'mreally not sure why we never used SERIAL fields. From an implementationstandpoint they are identical to our approach (the Postgresdocumentation<http://www.postgresql.org/docs/8.1/static/datatype.html#DATATYPE-SERIAL> evenshows you the equivalent syntax using a regular sequence), but theywould have been more analogous to the AUTO INCREMENT fields we use inMySQL and SQLite.

On the flip side, since they're the same there's likewise no realcompelling reason to use SERIAL over a sequence. We had previously (andhopefully still do) wanted to also support Oracle, which is very similarto Postgres. Oracle lacks a SERIAL data type, so it would require thesequence method to be used anyway. Continuing to use sequences inPostgres would mean less variation between the two and make it a goodbit easier to manage them.


<IMPROVEMENTS ACHIEVED>
Improvements and optimizations achieved with this little patch: (see Ref. [2])

0 - it will leave intact the need of Habari to use the numerical field for
internal functioning.
1 - (id) will be unique and auto incremental by properties inherited from
query.
2 - You won't skip a number in a SEQUENCE.
3 - It won't fail because of wrong creation query.
4 - It won't determine an unexpected SEQUENCE name/relation to the column you
want it to.
5 - OWNED BY is to be used implicitly , this will drop anything related to the
elected talbe and/or column. This sounds harsh, but what will you do w/ data
that isn't associated w/ anything anymore?

<FINAL THOUGHTS>
1 - For the impact on Habari's code (which I didn't read because I don't do
php and couldn't afford to get it wrong) on insert the next value of the
sequence into the serial column please read Ref. [2] just below the PostgreSQL
7.3 Note.

2 - This new schema.sql would work (in theory) for new installations but I
don't know how to implement and test a system/schema/pgsql/upgrades/post/*.sql
yet. Help and advice from you is very much apprecciated.

3 - Let me know if anyone is going to test this before I do. If you want.
Anyway I'm going to test it when my test env will be up and running.

4 - I'm already planning the next batch of fixes but that can wait for now. I
anticipate that it will be about optimizing queries and about keyvil issues as
much as possible and w/o breaking Habari Ref. [3].

My pleasure to meet you.

<REFERENCES>
Ref. chat log attacched.

Ref [1]. http://www.postgresql.org/docs/8.4/interactive/sql-
createsequence.html

Ref [2]. http://www.postgresql.org/docs/8.4/interactive/datatype-
numeric.html#DATATYPE-SERIAL

Ref. [3] http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-ii-7345
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-iii-7365


--
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev

<<inline: compose-unknown-contact.jpg>>

Re: [habari-dev] [PostgreSQL] - SCHEMA.SQL - CREATE SEQUENCE Improvements

Reply via email to