Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

Heikki Linnakangas Wed, 15 Feb 2012 08:02:37 -0800

On 13.02.2012 19:13, Fujii Masao wrote:

On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
<heikki.linnakan...@enterprisedb.com>  wrote:

On 13.02.2012 01:04, Jeff Janes wrote:


Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.



Thanks. That's basically the right approach. Attached patch contains a
cleaned up version of that.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.



Can you write a self-contained test case for that? I've been trying to
reproduce that by running the regression tests and pgbench with a streaming
replication standby, which should be pretty much the same as crash recovery.
No luck this far.


Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "create table t (i int); insert into t
values(generate_series(1,10000)); delete from t"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG:  database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  invalid magic number 0000 in log file 0, segment 1, offset 8060928
LOG:  redo done at 0/17AD858
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After recovery, I could not see the table "t" which I created before:

$ psql -c "select count(*) from t"
ERROR:  relation "t" does not exist

Are you still seeing this failure with the latest patch I posted(http://archives.postgresql.org/message-id/4f38f5e5.8050...@enterprisedb.com)?That includes Jeff's fix for the original crash you and Jeff saw. Withthat version, I can't get a crash anymore. I also can't reproduce theinconsistency that Jeff still saw with his fix(http://archives.postgresql.org/message-id/CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+orfiyj...@mail.gmail.com).Jeff, can you clarify if you're still seeing an issue with the latestversion of the patch? If so, can you give a self-contained test case forthat?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

Reply via email to