Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Tom Lane
Sergey Koposov  writes:
> On Wed, 16 Jan 2013, Andres Freund wrote:
>> What about switching to -O1 of 11.1?

> I don't know. It is up to -hackers to decide. I think that icc on ia64 
> have shown bugginess time after time. But if you think that buildfarm 
> with icc 11.1 -O1 carry more information than say running gcc, i can 
> still run icc.

I think the reason that this bug doesn't manifest at -O1 is that then
icc doesn't attempt to do any loop unrolling/vectorizing.  So that's a
big chunk of potential optimization bugs we'd be dodging.  It's hard to
say whether that renders the test worthless in comparison with what
people would try to do in production.  Should we recommend that people
not try to use -O2 or higher with icc on IA64?

IMO it's important that we have some icc members in the buildfarm, just
because it's useful to see a different compiler's take on warnings.
We do have some icc-on-mainstream-Intel members, but not many.

Perhaps Sergey should use 10.1, which so far appears to not have so many
bugs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Sergey Koposov

On Wed, 16 Jan 2013, Andres Freund wrote:

So unless somebody suggest otherwise, i'm going to switch to gcc on this
buildfarm.


What about switching to -O1 of 11.1?


I don't know. It is up to -hackers to decide. I think that icc on ia64 
have shown bugginess time after time. But if you think that buildfarm 
with icc 11.1 -O1 carry more information than say running gcc, i can 
still run icc.


S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Andrew Dunstan


On 01/16/2013 09:41 AM, Sergey Koposov wrote:


So unless somebody suggest otherwise, i'm going to switch to gcc on 
this buildfarm.


If you switch compiler it should be a new buildfarm animal. (That just 
means re-registering so you get a new name/secret pair.) We have 
provision for upgrading the OS version and the compiler version, but 
changing which OS or which compiler is used requires a new animal - it's 
too great a discontinuity.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Andres Freund
On 2013-01-16 14:41:47 +, Sergey Koposov wrote:
> Hi,
> 
> On Wed, 16 Jan 2013, Andres Freund wrote:
> >On 2013-01-16 01:28:09 -0500, Tom Lane wrote:
> >>It's a compiler bug.
> 
> Thanks for investigating. But I'm not sure there is any way other way for me
> other than switching to gcc, because intel stopped providing their IA64
> version of compilers free of charge even for non-commercial/educational
> people:
> http://software.intel.com/en-us/intel-software-development-tools-for-intel-itanium-processors
> (their website is also a bit of maze, so I don't see anywhere the updated
> versions of the 10.0, 10.1, 11.1 compilers that I have)
> 
> So unless somebody suggest otherwise, i'm going to switch to gcc on this
> buildfarm.

What about switching to -O1 of 11.1?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Sergey Koposov

Hi,

On Wed, 16 Jan 2013, Andres Freund wrote:

On 2013-01-16 01:28:09 -0500, Tom Lane wrote:

It's a compiler bug.


Thanks for investigating. But I'm not sure there is any way other way for 
me other than switching to gcc, because intel stopped providing their 
IA64 version of compilers free of charge even for 
non-commercial/educational people:

http://software.intel.com/en-us/intel-software-development-tools-for-intel-itanium-processors
(their website is also a bit of maze, so I don't see anywhere 
the updated versions of the 10.0, 10.1, 11.1 compilers that I have)


So unless somebody suggest otherwise, i'm going to switch to gcc on this 
buildfarm.


Cheers,
Sergey

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-16 Thread Andres Freund
Hi,


On 2013-01-16 01:28:09 -0500, Tom Lane wrote:
> It's a compiler bug.

Gah. Not again. Not that I am surprised, but still.

> icc 11.1 apparently thinks that this loop in doPickSplit:

> (Why does it think it needs to prefetch an array it's only going to
> write into?  Is IA64's cache hardware really that stupid?)

I think it is. They had that strange model of putting all the
intelligence into the compiler and make the hardware relatively
dumb. Worked well I'd say.

> And it makes use of IA64's bizarre scheme for software-unrolling
> loops, which I am going to do my darnedest to forget now that I've
> learned it;
> ...
> Diagnosis: icc 11.1 is not ready for prime time.

Consider me impressed. I tried to see what went wrong from a code
generation POV with the original bug but I have to admit I gave up with
only an inkling of an idea what it could be (I think it miscalculate the
starting offset when copying the whole 'result' memory at once).
That instruction set is just plain too crazy for my brain.

The consequence seems to be icc on IA-64 in general is not ready for
prime time... I doubt they are still investigating significant resources
into it anyway.

> I shall now retire with a glass of wine and attempt to forget everything
> I just learned about IA64.  What a bizarre architecture ...

The explanation/problem alone leaves me in want of something stronger,
only its 1pm here...

Nice work,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
It's a compiler bug.

icc 11.1 apparently thinks that this loop in doPickSplit:

/*
 * Update nodes[] array to point into the newly formed innerTuple, so that
 * we can adjust their downlinks below.
 */
SGITITERATE(innerTuple, i, node)
{
nodes[i] = node;
}

is going to be iterated hundreds of times, because it expends a
ridiculous amount of effort on it, starting with a loop prolog that
prefetches about a thousand bytes starting at the nodes pointer :-(
(Why does it think it needs to prefetch an array it's only going to
write into?  Is IA64's cache hardware really that stupid?)
And it makes use of IA64's bizarre scheme for software-unrolling
loops, which I am going to do my darnedest to forget now that I've
learned it; but for the purposes of this bit you only need to know
that the br.wtop.dptk instruction "renames" registers 32 and up,
so that whatever is in r32 when the bottom of the loop is reached
will be in r33 on the next iteration, and r33's contents move to r34,
etc.  In this particular example, this ridiculous complication saves
a grand total of no instructions, but nevermind that.

Before starting the loop, the code has computed

r28 = innerTuple
r29 = nodes
r26 = r29 + 1200 (this is where it will continue the prefetching...)
r33 = 0
r35 = innerTuple + innerTuple->prefixSize + 8 (ie, the initial value of "nt")
r27 = innerTuple + innerTuple->prefixSize + 8 + 6

And the body of the SGITERATE loop looks like

.b4_110: 
at top of loop, r35 contains "nt" pointer, r33 contains "i"
 (p17)  st8 [r29]=r35,8 //0: {940:3} 4456 0
store nt at *r29, increment r29 by 8 bytes (thus, assign to nodes[i])
 (p17)  add r32=1,r33   //0: {938:2} 4453 0
compute i+1, will be next value of i due to register rename
 (p17)  ld2 r36=[r28]   //1: {938:2} 4462 0
fetch first 2 bytes of innerTuple
 (p17)  ld2 r34=[r27],r33   //1: {938:2} 4459 0
fetch last 2 bytes of node tuple, on first iteration anyway ...
and then add the value of r33 to r27, which is all wrong
 (p17)  extr.u  r37=r36,3,13//2: {938:2} 4463 0
extract nNodes from fetched 2 bytes of innerTuple
 (p17)  extr.u  r33=r34,0,13 ;; //2: {938:2} 4460 0
extract size field of node tuple, or so it hopes
 (p17)  lfetch.nt1  [r26],8 //3: {938:2} 4454 0
useless prefetch more than a thousand bytes away from the action
 (p17)  cmp4.lt p16,p0=r32,r37  //3: {938:2} 4464 0
compare whether r32 (next value of i) < nNodes
 (p17)  add r34=r35,r33 //3: {938:2} 4461 0
set r34 (next value of r35) to r35 + size field, or so it hopes
 (p16)  br.wtop.dptk.b4_110 ;;  //3: {938:2} 4465 0
rename the registers and do it again, if the cmp4 returned true

The problem with this code is that r27, which ought to be always equal
to r35 + 6, is incremented by the wrong amount in the second ld2
instruction, namely by the "i" counter.  The value that *should* get
added to it is the node size field, ie the same value that's loaded into
r33 below that and then added to r35 in the last add instruction (and
then stored into r34, which is about to become r35).  So I think the
compiler has outsmarted itself as to which rotating register contains
which value when.

The result of this breakage is that the set of node pointers computed by
the loop is totally wrong for all values after the first.  This means
the later loop that's trying to insert the now-known downlink TIDs into
the innerTuple's nodes is storing those TIDs into random locations, and
thus tromping all over memory.  The case where we get a reproducible
crash is where the Asserts in that loop notice that what's at the
pointed-to addresses isn't what's expected, before we manage to clobber
anything critical.

Diagnosis: icc 11.1 is not ready for prime time.

I shall now retire with a glass of wine and attempt to forget everything
I just learned about IA64.  What a bizarre architecture ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
Andres Freund  writes:
> -O0 passes

Grumble... suspect we're chasing another compiler bug now, but ...

You might try -O1; if that shows the bug it'll probably be a tad easier
to debug in.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-16 02:34:52 +0100, Andres Freund wrote:
> On 2013-01-16 02:13:26 +0100, Andres Freund wrote:
> > On 2013-01-15 19:56:52 -0500, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > FWIW its also triggerable if two other function calls are places inside
> > > > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)).
> > > 
> > > [ confused... ]  You mean replacing the abort() in the elog macro with
> > > one of these functions?  Or something else?
> > 
> > I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any
> > two function calls inside a do/while(0). I just tried to place some
> > random functions there instead of the elog to make sure its unrelated,
> > and it still triggers the problem even before the elog commit. The
> > assembler output of that function changes wildly with tiny changes and I
> > don't understand IA-64 at all (does anybody?), so I don't see anything
> > we can do there.
> > 
> > > > It seems the change just made an existing issue visible.
> > > > No idea what to do about it.
> > > 
> > > Pretty clearly a compiler bug at this point.  Since there doesn't seem
> > > to be a clean workaround (no, I don't want to expand the struct
> > > assignment manually), and anyway we can't be sure that the bug doesn't
> > > also manifest in other places, recommending Sergey update his compiler
> > > seems like the thing to do.
> > 
> > Yea. Don't have a better suggestion.
> > 
> > > At this point I'm more interested in his report in
> > >  about
> > > the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
> > > more likely to have a genuine bug, and I wonder if it's related to
> > > the spgist issue in <50ebf992.2000...@qunar.com> ...
> > 
> > Yes, it looks more like it could be something real. There are
> > suspicously many other failing tests though (misc, with) that don't seem
> > to be related to the spgist crash.
> 
> #4  0x401a6320 in doPickSplit (index=0x6007ff48, state=0x3, 
> current=0x6ff7a700, parent=0x4, newLeafTuple=0x6, 
> level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222
> (gdb) p parent
> $4 = {blkno = 1, buffer = 356, page = 0x2148eea0 "", offnum = 1, node 
> = 4}
> 
> (gdb) p &parent
> $7 = (SPPageDesc *) 0x6ff7a900

-O0 passes

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-15 20:32:00 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2013-01-15 19:56:52 -0500, Tom Lane wrote:
> >> At this point I'm more interested in his report in
> >>  about
> >> the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
> >> more likely to have a genuine bug, and I wonder if it's related to
> >> the spgist issue in <50ebf992.2000...@qunar.com> ...
> 
> > Yes, it looks more like it could be something real. There are
> > suspicously many other failing tests though (misc, with) that don't seem
> > to be related to the spgist crash.
> 
> Looking again, the pg_regress output appears to indicate two separate
> crashes (one during rangetypes, the other during create_index).  The
> reported Assert trap was in the rangetypes test, but the other one
> could very easily have been from spgist code as well.  I'd tend to
> write off all the other reported diffs as followon damage from the
> crashes, at least without clearer evidence that they weren't.  There are
> very many instances in our regression tests where failure to complete
> one test results in bogus diffs in later ones, because DB objects don't
> exist or don't have the expected contents.

I just checked and its just followup damage.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-16 02:13:26 +0100, Andres Freund wrote:
> On 2013-01-15 19:56:52 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > FWIW its also triggerable if two other function calls are places inside
> > > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)).
> > 
> > [ confused... ]  You mean replacing the abort() in the elog macro with
> > one of these functions?  Or something else?
> 
> I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any
> two function calls inside a do/while(0). I just tried to place some
> random functions there instead of the elog to make sure its unrelated,
> and it still triggers the problem even before the elog commit. The
> assembler output of that function changes wildly with tiny changes and I
> don't understand IA-64 at all (does anybody?), so I don't see anything
> we can do there.
> 
> > > It seems the change just made an existing issue visible.
> > > No idea what to do about it.
> > 
> > Pretty clearly a compiler bug at this point.  Since there doesn't seem
> > to be a clean workaround (no, I don't want to expand the struct
> > assignment manually), and anyway we can't be sure that the bug doesn't
> > also manifest in other places, recommending Sergey update his compiler
> > seems like the thing to do.
> 
> Yea. Don't have a better suggestion.
> 
> > At this point I'm more interested in his report in
> >  about
> > the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
> > more likely to have a genuine bug, and I wonder if it's related to
> > the spgist issue in <50ebf992.2000...@qunar.com> ...
> 
> Yes, it looks more like it could be something real. There are
> suspicously many other failing tests though (misc, with) that don't seem
> to be related to the spgist crash.

#3  0x40b5c710 in ExceptionalCondition (
conditionName=0x40c76d50 "!(( ((void) ((bool) ((! assert_enabled) 
|| ! (!(((bool) (((const void*)(&nodes[n]->t_tid) != ((void *)0)) && 
((&nodes[n]->t_tid)->ip_posid != 0) || (ExceptionalCondition(\"!(((bool) 
(((const void*)"..., 
errorType=0x40c4c5a0 "FailedAssertion", fileName=0x40c75d30 
"spgdoinsert.c", lineNumber=1222) at assert.c:54
#4  0x401a6320 in doPickSplit (index=0x6007ff48, state=0x3, 
current=0x6ff7a700, parent=0x4, newLeafTuple=0x6, 
level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222
#5  0x401a12d0 in spgdoinsert (index=0x29856028, 
state=0x6ff7a9d0, heapPtr=0x601e6e7c, 
datum=6917546619826579712, isnull=0 '\0') at spgdoinsert.c:1996
#6  0x40195870 in spginsert (fcinfo=0x6ff7a9d0) at 
spginsert.c:222
#7  0x40b77dd0 in FunctionCall6Coll (flinfo=0x60102018, 
collation=0, arg1=2305843009373429800, arg2=6917546619826580944, 
arg3=6917546619826581200, arg4=6917529027643076220, 
arg5=2305843009373166576, arg6=0) at fmgr.c:1439
#8  0x40148b70 in index_insert (indexRelation=0x29856028, 
values=0x6ff7add0, isnull=0x6ff7aed0 "", 
heap_t_ctid=0x601e6e7c, heapRelation=0x29815bf0, 
checkUnique=UNIQUE_CHECK_NO) at indexam.c:216
#9  0x404e99f0 in ExecInsertIndexTuples (slot=0x601e55c0, 
tupleid=0x601e6e7c, estate=0x601e4f18)
at execUtils.c:1088
#10 0x40516710 in ExecModifyTable (node=0x0) at nodeModifyTable.c:249
#11 0x404c6350 in $$1$3_0$TAG$0ca$0$3 () at execProcnode.c:377
#12 0x404bba00 in ExecutorRun (queryDesc=0x601e4fb0, 
direction=NoMovementScanDirection, count=0) at execMain.c:1400
#13 0x408493f0 in PortalRunMulti (portal=0x600ff7f8, 
isTopLevel=-26 '�', dest=0x601ef658, 
altdest=0x601ef658, completionTag=0x6ff7b2d0 "") at 
pquery.c:185
#14 0x40848d20 in _setjmp_lpad_PortalRun_1$0$13 () at pquery.c:814
#15 0x40840c60 in exec_simple_query (
query_string=0x6018d4f8 "insert into test_range_spgist select 
'empty'::int4range from generate_series(1,500) g;")
at postgres.c:1048
#16 0x408370a0 in _setjmp_lpad_PostgresMain_0$0$51 () at postgres.c:3969
---Type  to continue, or q  to quit---
#17 0x40720240 in BackendStartup (port=0x600fc950) at 
postmaster.c:3989
#18 0x4071dc80 in ServerLoop () at postmaster.c:1575
#19 0x4071a700 in PostmasterMain (argc=9, argv=0x600dc300) at 
postmaster.c:1244
#20 0x405796d0 in main (argc=9, argv=0x600dc010) at main.c:197


#4  0x401a6320 in doPickSplit (index=0x6007ff48, state=0x3, 
current=0x6ff7a700, parent=0x4, newLeafTuple=0x6, 
level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222
1222
Assert(ItemPointerGetBlockNumber(&nodes[n]->t_tid) == leafBlock);

(gdb) info locals
in = {nTuples = 227, datums = 0x60205060, level = 1}
out = {hasPrefix = 0 '\0', prefixDatum = 0, nNodes = 8, nodeLabels = 0

Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
Andres Freund  writes:
> On 2013-01-15 19:56:52 -0500, Tom Lane wrote:
>> At this point I'm more interested in his report in
>>  about
>> the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
>> more likely to have a genuine bug, and I wonder if it's related to
>> the spgist issue in <50ebf992.2000...@qunar.com> ...

> Yes, it looks more like it could be something real. There are
> suspicously many other failing tests though (misc, with) that don't seem
> to be related to the spgist crash.

Looking again, the pg_regress output appears to indicate two separate
crashes (one during rangetypes, the other during create_index).  The
reported Assert trap was in the rangetypes test, but the other one
could very easily have been from spgist code as well.  I'd tend to
write off all the other reported diffs as followon damage from the
crashes, at least without clearer evidence that they weren't.  There are
very many instances in our regression tests where failure to complete
one test results in bogus diffs in later ones, because DB objects don't
exist or don't have the expected contents.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-15 19:56:52 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > FWIW its also triggerable if two other function calls are places inside
> > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)).
> 
> [ confused... ]  You mean replacing the abort() in the elog macro with
> one of these functions?  Or something else?

I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any
two function calls inside a do/while(0). I just tried to place some
random functions there instead of the elog to make sure its unrelated,
and it still triggers the problem even before the elog commit. The
assembler output of that function changes wildly with tiny changes and I
don't understand IA-64 at all (does anybody?), so I don't see anything
we can do there.

> > It seems the change just made an existing issue visible.
> > No idea what to do about it.
> 
> Pretty clearly a compiler bug at this point.  Since there doesn't seem
> to be a clean workaround (no, I don't want to expand the struct
> assignment manually), and anyway we can't be sure that the bug doesn't
> also manifest in other places, recommending Sergey update his compiler
> seems like the thing to do.

Yea. Don't have a better suggestion.

> At this point I'm more interested in his report in
>  about
> the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
> more likely to have a genuine bug, and I wonder if it's related to
> the spgist issue in <50ebf992.2000...@qunar.com> ...

Yes, it looks more like it could be something real. There are
suspicously many other failing tests though (misc, with) that don't seem
to be related to the spgist crash.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
Andres Freund  writes:
> FWIW its also triggerable if two other function calls are places inside
> the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)).

[ confused... ]  You mean replacing the abort() in the elog macro with
one of these functions?  Or something else?

> It seems the change just made an existing issue visible.
> No idea what to do about it.

Pretty clearly a compiler bug at this point.  Since there doesn't seem
to be a clean workaround (no, I don't want to expand the struct
assignment manually), and anyway we can't be sure that the bug doesn't
also manifest in other places, recommending Sergey update his compiler
seems like the thing to do.

At this point I'm more interested in his report in
 about
the Assert at spgdoinsert.c:1222 failing.  That's pretty new code, so
more likely to have a genuine bug, and I wonder if it's related to
the spgist issue in <50ebf992.2000...@qunar.com> ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-16 00:26:01 +0100, Andres Freund wrote:
> On 2013-01-15 17:56:40 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > I played a bit arround (thanks Sergey!) and it seems to be some rather
> > > strange optimization issue around the fsync request queue.
> > 
> > > Namely changing 
> > >   request->rnode = rnode;
> > > into
> > >   request->rnode.spcNode = rnode.spcNode;
> > >   request->rnode.dbNode = rnode.dbNode;
> > >   request->rnode.relNode = rnode.relNode;
> > > makes it pass reliably.
> > 
> > Jeez.  That's my candidate for weird compiler bug of the month.
> > 
> > > How the hell thats correlating with the elog changes I don't yet know.
> > 
> > There is an elog(ERROR) further up in the same function, but it's sure
> > not clear how that could cause the compiler to misimplement a struct
> > assignment.
> 
> Indeed, replacing the elog() there with a plain abort() or the old-style
> elog definition makes it work. Just using a do-while with the old
> definition inside makes it fail.
> 
> My IA64 knowledge is pretty basic, but I would guess this is stack or
> code alignment related I seem to remember quite some strange
> requirements there.

FWIW its also triggerable if two other function calls are places inside
the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)).
It seems the change just made an existing issue visible.

No idea what to do about it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-15 17:56:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I played a bit arround (thanks Sergey!) and it seems to be some rather
> > strange optimization issue around the fsync request queue.
> 
> > Namely changing 
> > request->rnode = rnode;
> > into
> > request->rnode.spcNode = rnode.spcNode;
> > request->rnode.dbNode = rnode.dbNode;
> > request->rnode.relNode = rnode.relNode;
> > makes it pass reliably.
> 
> Jeez.  That's my candidate for weird compiler bug of the month.
> 
> > How the hell thats correlating with the elog changes I don't yet know.
> 
> There is an elog(ERROR) further up in the same function, but it's sure
> not clear how that could cause the compiler to misimplement a struct
> assignment.

Indeed, replacing the elog() there with a plain abort() or the old-style
elog definition makes it work. Just using a do-while with the old
definition inside makes it fail.

My IA64 knowledge is pretty basic, but I would guess this is stack or
code alignment related I seem to remember quite some strange
requirements there.

>  Maybe the problem is not in those lines alone, but the fact
> that rnode is a pass-by-value struct?  (That is, maybe it's the value of
> the rnode local variable that's getting munged, somewhere up near the
> elog call?)

No, I found this because I printed the values before enquing the values
into shmem and after dequeing. After noticing that they didn't match I
added more...

> We tend to not use pass-by-value struct params much, so we
> might not have noticed a compiler bug associated with that.  Or IOW,
> does changing ForwardFsyncRequest to use a "const RelFileNode *rnode"
> parameter make it go away?

Nope, same thing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
Andres Freund  writes:
> I played a bit arround (thanks Sergey!) and it seems to be some rather
> strange optimization issue around the fsync request queue.

> Namely changing 
>   request->rnode = rnode;
> into
>   request->rnode.spcNode = rnode.spcNode;
>   request->rnode.dbNode = rnode.dbNode;
>   request->rnode.relNode = rnode.relNode;
> makes it pass reliably.

Jeez.  That's my candidate for weird compiler bug of the month.

> How the hell thats correlating with the elog changes I don't yet know.

There is an elog(ERROR) further up in the same function, but it's sure
not clear how that could cause the compiler to misimplement a struct
assignment.  Maybe the problem is not in those lines alone, but the fact
that rnode is a pass-by-value struct?  (That is, maybe it's the value of
the rnode local variable that's getting munged, somewhere up near the
elog call?)  We tend to not use pass-by-value struct params much, so we
might not have noticed a compiler bug associated with that.  Or IOW,
does changing ForwardFsyncRequest to use a "const RelFileNode *rnode"
parameter make it go away?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-15 14:40:11 -0500, Tom Lane wrote:
> Sergey Koposov  writes:
> > And I do see the tblspc file left after the finish of "make check":
> > tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387
> 
> Interesting.  If the tests are run immediately after initdb, 16387
> is the relfilenode assigned to table "foo" in the tablespace regression
> test.  But why would only that table be left behind?  There are half
> a dozen relations in that tablespace at the point of the DROP CASCADE.
> 
> BTW, I just finished trying to reproduce this on an IA64 machine
> belonging to Red Hat, without success.  So that seems to eliminate
> any possibility of the machine architecture being the trigger issue.
> The compiler's still a likely cause though.
> 
> Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
> to try?  Also, Sergey, can you find a non-dot-zero release to try?

I played a bit arround (thanks Sergey!) and it seems to be some rather
strange optimization issue around the fsync request queue.

Namely changing 

/* OK, insert request */
request = 
&CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
request->rnode = rnode;
request->forknum = forknum;
request->segno = segno;
into
/* OK, insert request */
request = 
&CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
request->rnode.spcNode = rnode.spcNode;
request->rnode.dbNode = rnode.dbNode;
request->rnode.relNode = rnode.relNode;
request->forknum = forknum;
request->segno = segno;
makes it pass reliably.

Displaying the values of request after the assignment, but without the
change shows bogus values showing up which explains the problems.

How the hell thats correlating with the elog changes I don't yet know.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Sergey Koposov

On Tue, 15 Jan 2013, Tom Lane wrote:


BTW, I just finished trying to reproduce this on an IA64 machine
belonging to Red Hat, without success.  So that seems to eliminate
any possibility of the machine architecture being the trigger issue.
The compiler's still a likely cause though.
Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
to try?  Also, Sergey, can you find a non-dot-zero release to try?


I think it is indeed the main issue.
I've tried 10.1 ( 10.1.011 ) and it doesn't fail.

When I tried 11.1 (icc (ICC) 11.1 20100401 ) it failed in a quite strange 
way (I don't quite remember it happenning before)


test tablespace   ... ok
parallel group (18 tests):  txid int2 text name oid varchar int4 char money
float8 uuid float4 int8 boolean bit enum numeric rangetypes
 boolean  ... ok
 char ... ok
 name ... ok
 varchar  ... ok
 text ... ok
 int2 ... ok
 int4 ... ok
 int8 ... ok
 oid  ... ok
 float4   ... ok
 float8   ... ok
 bit  ... ok
 numeric  ... ok
 txid ... ok
 uuid ... ok
 enum ... ok
 money... ok
 rangetypes   ... FAILED (test process exited with exit code
 2)
test strings  ... FAILED (test process exited with exit code
2)
test numerology   ... FAILED (test process exited with exit code
2)
parallel group (19 tests):  path interval time inet circle macaddr comments
timestamp timestamptz reltime date tstypes tinterval abstime timetz lseg box
polygon point
 point... FAILED (test process exited with exit code
2)
 lseg ... FAILED (test process exited with exit code
2)
 box  ... FAILED (test process exited with exit code
2)
 path ... FAILED (test process exited with exit code
2)
 polygon  ... FAILED (test process exited with exit code
2)
 circle   ... FAILED (test process exited with exit code
2)
 date ... FAILED (test process exited with exit code
2)
 time ... FAILED (test process exited with exit code
2)
 timetz   ... FAILED (test process exited with exit code
2)
 timestamp... FAILED (test process exited with exit code
2)
 timestamptz  ... FAILED (test process exited with exit code
2)
 interval ... FAILED (test process exited with exit code
2)
 abstime  ... FAILED (test process exited with exit code
2)
 reltime  ... FAILED (test process exited with exit code
2)
 tinterval... FAILED (test process exited with exit code
2)
 inet ... FAILED (test process exited with exit code
2)
 macaddr  ... FAILED (test process exited with exit code
2)
 tstypes  ... FAILED (test process exited with exit code
2)
 comments ... FAILED (test process exited with exit code
2)
parallel group (6 tests):  geometry regex horology type_sanity oidjoins
opr_sanity
 geometry ... FAILED
 horology ... FAILED
 regex... ok
 oidjoins ... ok
 type_sanity  ... ok
 opr_sanity   ... ok
test insert   ... ok
test create_function_1... ok
test create_type  ... ok
test create_table ... ok
test create_function_2... ok
parallel group (2 tests):  copyselect copy
 copy ... ok
 copyselect   ... ok
parallel group (2 tests):  create_operator create_misc
 create_misc  ... ok
 create_operator  ... ok
parallel group (2 tests):  create_view create_index
 create_index ... FAILED (test process exited with exit code
2)
 create_view  ... ok
parallel group (11 tests):  constraints triggers create_cast
create_function_3 updatable_views inherit drop_if_exists create_aggregate
create_table_like typed_table vacuum
 create_aggregate ... FAILED (test process exited with exit code
2)
 create_function_3... FAILED (test process exited with exit code
2)
 create_cast  ... FAILED (test process exited with exit code
2)
 constraints  ... FAILED (test process exited with exit code
2)
 triggers ... FAILED (test process exited with exit code
2)
 inherit  ... FAILED (test process exited with exit code
2)
 create_table_like... FAILED (test process exite

Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Tom Lane
Sergey Koposov  writes:
> And I do see the tblspc file left after the finish of "make check":
>   tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Interesting.  If the tests are run immediately after initdb, 16387
is the relfilenode assigned to table "foo" in the tablespace regression
test.  But why would only that table be left behind?  There are half
a dozen relations in that tablespace at the point of the DROP CASCADE.

BTW, I just finished trying to reproduce this on an IA64 machine
belonging to Red Hat, without success.  So that seems to eliminate
any possibility of the machine architecture being the trigger issue.
The compiler's still a likely cause though.

Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
to try?  Also, Sergey, can you find a non-dot-zero release to try?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Sergey Koposov

On Tue, 15 Jan 2013, Andres Freund wrote:

Any chance you could run make check again but with log_statement=all and
log_min_messages=debug2? That might tell us a bit more, whether the
dependency code doesn't work right or whether the checkpoint is doing
strange things.


Here it is :


2013-01-15 23:06:18 MSK [50f5a8aa.1162:1] DEBUG:  SlruScanDirectory invoking 
callback on pg_notify/
2013-01-15 23:06:18 MSK [50f5a8aa.1162:2] DEBUG:  removing file "pg_notify/"
2013-01-15 23:06:18 MSK [50f5a8aa.1162:3] DEBUG:  max_safe_fds = 985, 
usable_fds = 1000, already_open = 5
2013-01-15 23:06:18 MSK [50f5a8aa.1167:1] LOG:  database system was shut down 
at 2013-01-15 23:06:18 MSK
2013-01-15 23:06:18 MSK [50f5a8aa.1167:2] DEBUG:  checkpoint record is at 
0/17700E0
2013-01-15 23:06:18 MSK [50f5a8aa.1167:3] DEBUG:  redo record is at 0/17700E0; 
shutdown TRUE
2013-01-15 23:06:18 MSK [50f5a8aa.1167:4] DEBUG:  next transaction ID: 0/686; 
next OID: 12031
2013-01-15 23:06:18 MSK [50f5a8aa.1167:5] DEBUG:  next MultiXactId: 1; next 
MultiXactOffset: 0
2013-01-15 23:06:18 MSK [50f5a8aa.1167:6] DEBUG:  oldest unfrozen transaction 
ID: 676, in database 1
2013-01-15 23:06:18 MSK [50f5a8aa.1167:7] DEBUG:  transaction ID wrap limit is 
2147484323, limited by database with OID 1
2013-01-15 23:06:18 MSK [50f5a8aa.1168:1] DEBUG:  checkpointer updated shared 
memory configuration values
2013-01-15 23:06:18 MSK [50f5a8aa.116b:1] LOG:  autovacuum launcher started
2013-01-15 23:06:18 MSK [50f5a8aa.1162:4] LOG:  database system is ready to 
accept connections
2013-01-15 23:06:19 MSK [50f5a8aa.1162:5] DEBUG:  forked new backend, pid=4463 
socket=8
2013-01-15 23:06:19 MSK [50f5a8aa.1162:6] DEBUG:  server process (PID 4463) 
exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:7] DEBUG:  forked new backend, pid=4465 
socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1171:1] LOG:  statement: CREATE DATABASE 
"regression" TEMPLATE=template0
2013-01-15 23:06:19 MSK [50f5a8aa.1168:2] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:3] DEBUG:  SlruScanDirectory invoking 
callback on pg_multixact/offsets/
2013-01-15 23:06:19 MSK [50f5a8aa.1168:4] DEBUG:  SlruScanDirectory invoking 
callback on pg_multixact/members/
2013-01-15 23:06:19 MSK [50f5a8aa.1168:5] DEBUG:  attempting to remove WAL 
segments older than log file 
2013-01-15 23:06:19 MSK [50f5a8aa.1168:6] DEBUG:  SlruScanDirectory invoking 
callback on pg_subtrans/
2013-01-15 23:06:19 MSK [50f5a8aa.1168:7] LOG:  checkpoint complete: wrote 3 
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; 
write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, 
average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8aa.1168:8] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:9] DEBUG:  attempting to remove WAL 
segments older than log file 
2013-01-15 23:06:19 MSK [50f5a8aa.1168:10] DEBUG:  SlruScanDirectory invoking 
callback on pg_subtrans/
2013-01-15 23:06:19 MSK [50f5a8aa.1168:11] LOG:  checkpoint complete: wrote 0 
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; 
write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, 
average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8aa.1162:8] DEBUG:  server process (PID 4465) 
exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:9] DEBUG:  forked new backend, pid=4467 
socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1173:1] LOG:  statement: ALTER DATABASE "regression" SET lc_messages TO 'C';ALTER 
DATABASE "regression" SET lc_monetary TO 'C';ALTER DATABASE "regression" SET lc_numeric TO 'C';ALTER DATABASE 
"regression" SET lc_time TO 'C';ALTER DATABASE "regression" SET timezone_abbreviations TO 'Default';
2013-01-15 23:06:19 MSK [50f5a8aa.1162:10] DEBUG:  server process (PID 4467) 
exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:11] DEBUG:  forked new backend, pid=4469 
socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1175:1] LOG:  statement: CREATE TABLESPACE 
testspace LOCATION '/home/math/pg_git/src/test/regress/testtablespace';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:2] LOG:  statement: ALTER TABLESPACE 
testspace SET (random_page_cost = 1.0);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:3] LOG:  statement: ALTER TABLESPACE 
testspace SET (some_nonexistent_parameter = true);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:4] ERROR:  unrecognized parameter 
"some_nonexistent_parameter"
2013-01-15 23:06:19 MSK [50f5a8ab.1175:5] STATEMENT:  ALTER TABLESPACE 
testspace SET (some_nonexistent_parameter = true);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:6] LOG:  statement: ALTER TABLESPACE 
testspace RESET (random_page_cost = 2.0);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:7] ERROR:  RESET must not include values 
for parameters
2013-01-15 23:06:19 MSK [50f5a8ab.1175:8] STATEMENT:  ALTER TABLESPACE 
testspace RESET (random_page_cost = 2.0);
2013-01-15 23:

Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Andres Freund
On 2013-01-15 17:27:50 +, Sergey Koposov wrote:
> Hi,
> 
> >Date: Tue, 15 Jan 2013 11:57:07 -0500
> >From: Tom Lane 
> >To: Andres Freund 
> >Cc: m...@sai.msu.ru, pgsql-hackers@postgreSQL.org,
> >   Andrew Dunstan 
> >Subject: Re: Curious buildfarm failures
> >
> >Well, it could be quite reproducible, if for example what's happening is
> >that the DROP is failing to wait for the checkpointer at all.
> >
> >Is there a way to enable log_checkpoints during a buildfarm run?
> >It'd be good to get timestamps added to the postmaster log entries, too.
> 
> Here is the log output from the failing pg_regress after enabling checkpoints 
> and timestamps:
> 
> 2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG:  database system was shut down 
> at 2013-01-15 21:20:19 MS
> K
> 2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG:  autovacuum launcher started
> 2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG:  database system is ready to 
> accept connections
> 2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG:  checkpoint starting: 
> immediate force wait
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG:  checkpoint complete: wrote 3 
> buffers (0.0%); 0 transact
> ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, 
> total=0.605 s; sync files=0,
>  longest=0.000 s, average=0.000 s
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG:  checkpoint starting: 
> immediate force wait
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG:  checkpoint complete: wrote 0 
> buffers (0.0%); 0 transact
> ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, 
> total=0.000 s; sync files=0,
>  longest=0.000 s, average=0.000 s
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR:  unrecognized parameter 
> "some_nonexistent_parameter"
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT:  ALTER TABLESPACE 
> testspace SET (some_nonexistent_
> parameter = true);
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR:  RESET must not include 
> values for parameters
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT:  ALTER TABLESPACE 
> testspace RESET (random_page_cos
> t = 2.0);
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR:  duplicate key value 
> violates unique constraint "anind
> ex"
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL:  Key (column1)=(1) already 
> exists.
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT:  INSERT INTO 
> testschema.atable VALUES(1);
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR:  directory 
> "/no/such/location" does not exist
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT:  CREATE TABLESPACE 
> badspace LOCATION '/no/such/loc
> ation';
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR:  tablespace "nosuchspace" 
> does not exist
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT:  CREATE TABLE bar (i 
> int) TABLESPACE nosuchspace;
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG:  checkpoint starting: 
> immediate force wait
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG:  checkpoint complete: wrote 37 
> buffers (0.2%); 0 transac
> tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, 
> total=0.001 s; sync files=0
> , longest=0.000 s, average=0.000 s
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR:  tablespace "testspace" is 
> not empty
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT:  DROP TABLESPACE 
> testspace;
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG:  checkpoint starting: 
> immediate force wait
> 2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG:  checkpoint complete: wrote 9 
> buffers (0.1%); 0 transact
> ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, 
> total=0.001 s; sync files=0,
>  longest=0.000 s, average=0.000 s
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR:  tablespace "testspace" is 
> not empty
> 2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT:  DROP TABLESPACE 
> testspace;
> 
> 
> And I do see the tblspc file left after the finish of "make check":
>   tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387
> 
> Cheers,
>   S
> 
> PS I wouldn't be surprised that it is a compiler bug though. But I did see
> the failure with newer icc as well.

Any chance you could run make check again but with log_statement=all and
log_min_messages=debug2? That might tell us a bit more, whether the
dependency code doesn't work right or whether the checkpoint is doing
strange things.

Thannks,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures (fwd)

2013-01-15 Thread Sergey Koposov

Hi,


Date: Tue, 15 Jan 2013 11:57:07 -0500
From: Tom Lane 
To: Andres Freund 
Cc: m...@sai.msu.ru, pgsql-hackers@postgreSQL.org,
   Andrew Dunstan 
Subject: Re: Curious buildfarm failures

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.


Here is the log output from the failing pg_regress after enabling checkpoints 
and timestamps:

2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG:  database system was shut down 
at 2013-01-15 21:20:19 MS
K
2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG:  autovacuum launcher started
2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG:  database system is ready to 
accept connections
2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG:  checkpoint complete: wrote 3 
buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, 
total=0.605 s; sync files=0,
 longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG:  checkpoint complete: wrote 0 
buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, 
total=0.000 s; sync files=0,
 longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR:  unrecognized parameter 
"some_nonexistent_parameter"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT:  ALTER TABLESPACE 
testspace SET (some_nonexistent_
parameter = true);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR:  RESET must not include values 
for parameters
2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT:  ALTER TABLESPACE 
testspace RESET (random_page_cos
t = 2.0);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR:  duplicate key value violates 
unique constraint "anind
ex"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL:  Key (column1)=(1) already 
exists.
2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT:  INSERT INTO 
testschema.atable VALUES(1);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR:  directory "/no/such/location" 
does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT:  CREATE TABLESPACE 
badspace LOCATION '/no/such/loc
ation';
2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR:  tablespace "nosuchspace" 
does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT:  CREATE TABLE bar (i int) 
TABLESPACE nosuchspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG:  checkpoint complete: wrote 37 
buffers (0.2%); 0 transac
tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, 
total=0.001 s; sync files=0
, longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR:  tablespace "testspace" is 
not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT:  DROP TABLESPACE 
testspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG:  checkpoint starting: immediate 
force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG:  checkpoint complete: wrote 9 
buffers (0.1%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, 
total=0.001 s; sync files=0,
 longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR:  tablespace "testspace" is 
not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT:  DROP TABLESPACE 
testspace;


And I do see the tblspc file left after the finish of "make check":
tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Cheers,
S

PS I wouldn't be surprised that it is a compiler bug though. But I did see the 
failure with newer icc as well.


*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Andrew Dunstan


On 01/15/2013 12:07 PM, Andrew Dunstan wrote:


On 01/15/2013 11:57 AM, Tom Lane wrote:

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.





Yes, it's very easy. In the config file, do something like:


I had a missing quote. should be:

   extra_config =>
 {
 DEFAULT => [
  q(log_line_prefix = '%t [%c:%l] '),
  "log_checkpoints = 'true'",
  ],
 },

cheers


andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Andrew Dunstan


On 01/15/2013 11:57 AM, Tom Lane wrote:

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.





Yes, it's very easy. In the config file, do something like:

 extra_config =>
 {
 DEFAULT => [
  q(log_line_prefix = '%t [%c:%l] '),
  "log_checkpoints = 'true',
  ],
 },



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Tom Lane
Andres Freund  writes:
> Interestingly the compiler couldn't deduce that
> e.g. DateTimeParseError() didn't return with the old ereport definition
> but it can with the new one which seems strange.

Oooh, I hadn't noticed that.  I guess that must indicate that this
version of icc can deduce unreachability from

if (true)
abort();

but not from

(true) ? abort() : (void) 0;

which is a bit odd but not incredible.  (I had actually wondered while
working on the patch if this wording might be better for some compilers;
seems that's the case.)

What that means is that this compiler was not previously aware that
either ereport(ERROR) or elog(ERROR) doesn't return, but it now knows
that for both.  So that greatly expands the scope of places where
behavior might have changed.  Doesn't simplify our task :-(

>> I'm betting the answer is "none", and that what's happening is some sort
>> of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
>> checkpointer process to clean out all the just-dropped files).

> It seems strange tho that it started failing exactly with that commit
> and all runs failed exactly the same way since.

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Andres Freund
On 2013-01-15 11:19:28 -0500, Tom Lane wrote:
> Andres Freund  writes:
> >>> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
>  Another thing is that dugong has been reproducibly failing with
>  
>  drop cascades to table testschema.atable
>  -- Should succeed
>  DROP TABLESPACE testspace;
>  + ERROR:  tablespace "testspace" is not empty
>  
>  since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
>  some local problem there but I'm suspicious that there's a connection.
>  But what?
> 
> > Do you have idea whats going on? I don't really have any clue other than
> > guessing it might be an compiler bug.
> 
> I'm suspicious of that too, but it's hard to see why it would manifest
> like this while everything else appears to be fine.  A codegen bug
> triggered by elog ought to show up in a lot of places, one would think.

The make output showed that for some files optimization were disabled by
the compiler because they were to complex. Its possible that it is
related to that :(.

Interestingly the compiler couldn't deduce that
e.g. DateTimeParseError() didn't return with the old ereport definition
but it can with the new one which seems strange.

> > Could the buildfarm owner perhaps tell us which files are left in the
> > tablespace 'testspace'?
> 
> I'm betting the answer is "none", and that what's happening is some sort
> of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
> checkpointer process to clean out all the just-dropped files).  The
> PANIC at shutdown looks like it might be some sort of doppelganger of
> the same bug, ie dropped table cleaned out too early.

It seems strange tho that it started failing exactly with that commit
and all runs failed exactly the same way since.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Andrew Dunstan


On 01/15/2013 11:04 AM, Andres Freund wrote:
Could the buildfarm owner perhaps tell us which files are left in the 
tablespace 'testspace'?



They will not be able to easily - the workspace is normally cleared out 
at the end of each run.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Tom Lane
Andres Freund  writes:
>>> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
 Another thing is that dugong has been reproducibly failing with
 
 drop cascades to table testschema.atable
 -- Should succeed
 DROP TABLESPACE testspace;
 + ERROR:  tablespace "testspace" is not empty
 
 since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
 some local problem there but I'm suspicious that there's a connection.
 But what?

> Do you have idea whats going on? I don't really have any clue other than
> guessing it might be an compiler bug.

I'm suspicious of that too, but it's hard to see why it would manifest
like this while everything else appears to be fine.  A codegen bug
triggered by elog ought to show up in a lot of places, one would think.

> Could the buildfarm owner perhaps tell us which files are left in the
> tablespace 'testspace'?

I'm betting the answer is "none", and that what's happening is some sort
of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
checkpointer process to clean out all the just-dropped files).  The
PANIC at shutdown looks like it might be some sort of doppelganger of
the same bug, ie dropped table cleaned out too early.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-15 Thread Andres Freund
On 2013-01-14 22:56:47 +0100, Andres Freund wrote:
> On 2013-01-14 22:50:16 +0100, Andres Freund wrote:
> > On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> > > Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> > > a few buildfarm failures along the lines of
> > >
> > >   -- Commit table drop
> > >   COMMIT PREPARED 'regress-two';
> > > ! PANIC:  failed to re-find shared proclock object
> > > ! PANIC:  failed to re-find shared proclock object
> > > ! connection to server was lost
> > >
> > > Evidently I bollixed something, but what?  I've been unable to reproduce
> > > this locally so far.  Anybody see what's wrong?
> > >
> > > Another thing is that dugong has been reproducibly failing with
> > >
> > >  drop cascades to table testschema.atable
> > >   -- Should succeed
> > >   DROP TABLESPACE testspace;
> > > + ERROR:  tablespace "testspace" is not empty
> > >
> > > since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> > > some local problem there but I'm suspicious that there's a connection.
> > > But what?
> > >
> > > Any insights out there?
> >
> > It also has:
> >
> > FATAL:  could not open file "base/16384/28182": No such file or directory
> > CONTEXT:  writing block 6 of relation base/16384/28182
> > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 
> > 1743)
> 
> > #3  0x40b4b510 in ExceptionalCondition (
> > conditionName=0x40d76390 "!(PrivateRefCount[i] == 0)",
> > errorType=0x40d06500 "FailedAssertion",
> > fileName=0x40d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
> > #4  0x407a7d20 in AtProcExit_Buffers (code=1, arg=59) at 
> > bufmgr.c:1743
> > #5  0x407c4e50 in shmem_exit (code=1) at ipc.c:221
> >
> > in the log. So it seems like it also could be related to locking
> > changes although I don't immediately see why.
> 
> No such "luck" though, the animal only applied the elog commits:
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Do you have idea whats going on? I don't really have any clue other than
guessing it might be an compiler bug.

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-14 Thread Tom Lane
Heikki Linnakangas  writes:
> The problem seems to be when the the old and the key hash to the same 
> bucket. In that case, hash_update_hash_key() tries to link the entry to 
> itself. The attached patch fixes it for me.

Doh!  I had a feeling that that needed a special case, but didn't think
hard enough.  Thanks.

I think the patch could do with more than no comment, but will fix
that and apply.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-14 Thread Heikki Linnakangas

On 15.01.2013 00:14, Heikki Linnakangas wrote:

On 14.01.2013 23:35, Tom Lane wrote:

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?


I was able to reproduce this by setting max_locks_per_transaction and
max_connections to the minimum. My assumption is that there's something
wrong in the way hash_update_hash_key() handles collisions.


The problem seems to be when the the old and the key hash to the same 
bucket. In that case, hash_update_hash_key() tries to link the entry to 
itself. The attached patch fixes it for me.


- Heikki
*** a/src/backend/utils/hash/dynahash.c
--- b/src/backend/utils/hash/dynahash.c
***
*** 1022,1027  hash_update_hash_key(HTAB *hashp,
--- 1022,1028 
  	uint32		newhashvalue;
  	Size		keysize;
  	uint32		bucket;
+ 	uint32		newbucket;
  	long		segment_num;
  	long		segment_ndx;
  	HASHSEGMENT segp;
***
*** 1078,1087  hash_update_hash_key(HTAB *hashp,
  	 */
  	newhashvalue = hashp->hash(newKeyPtr, hashp->keysize);
  
! 	bucket = calc_bucket(hctl, newhashvalue);
! 
! 	segment_num = bucket >> hashp->sshift;
! 	segment_ndx = MOD(bucket, hashp->ssize);
  
  	segp = hashp->dir[segment_num];
  
--- 1079,1087 
  	 */
  	newhashvalue = hashp->hash(newKeyPtr, hashp->keysize);
  
! 	newbucket = calc_bucket(hctl, newhashvalue);
! 	segment_num = newbucket >> hashp->sshift;
! 	segment_ndx = MOD(newbucket, hashp->ssize);
  
  	segp = hashp->dir[segment_num];
  
***
*** 1115,1126  hash_update_hash_key(HTAB *hashp,
  
  	currBucket = existingElement;
  
! 	/* OK to remove record from old hash bucket's chain. */
! 	*oldPrevPtr = currBucket->link;
  
! 	/* link into new hashbucket chain */
! 	*prevBucketPtr = currBucket;
! 	currBucket->link = NULL;
  
  	/* copy new key into record */
  	currBucket->hashvalue = newhashvalue;
--- 1115,1129 
  
  	currBucket = existingElement;
  
! 	if (bucket != newbucket)
! 	{
! 		/* OK to remove record from old hash bucket's chain. */
! 		*oldPrevPtr = currBucket->link;
  
! 		/* link into new hashbucket chain */
! 		*prevBucketPtr = currBucket;
! 		currBucket->link = NULL;
! 	}
  
  	/* copy new key into record */
  	currBucket->hashvalue = newhashvalue;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-14 Thread Heikki Linnakangas

On 14.01.2013 23:35, Tom Lane wrote:

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

   -- Commit table drop
   COMMIT PREPARED 'regress-two';
! PANIC:  failed to re-find shared proclock object
! PANIC:  failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what?  I've been unable to reproduce
this locally so far.  Anybody see what's wrong?


I was able to reproduce this by setting max_locks_per_transaction and 
max_connections to the minimum. My assumption is that there's something 
wrong in the way hash_update_hash_key() handles collisions.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-14 Thread Andres Freund
On 2013-01-14 22:50:16 +0100, Andres Freund wrote:
> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> > Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> > a few buildfarm failures along the lines of
> >
> >   -- Commit table drop
> >   COMMIT PREPARED 'regress-two';
> > ! PANIC:  failed to re-find shared proclock object
> > ! PANIC:  failed to re-find shared proclock object
> > ! connection to server was lost
> >
> > Evidently I bollixed something, but what?  I've been unable to reproduce
> > this locally so far.  Anybody see what's wrong?
> >
> > Another thing is that dugong has been reproducibly failing with
> >
> >  drop cascades to table testschema.atable
> >   -- Should succeed
> >   DROP TABLESPACE testspace;
> > + ERROR:  tablespace "testspace" is not empty
> >
> > since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> > some local problem there but I'm suspicious that there's a connection.
> > But what?
> >
> > Any insights out there?
>
> It also has:
>
> FATAL:  could not open file "base/16384/28182": No such file or directory
> CONTEXT:  writing block 6 of relation base/16384/28182
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 
> 1743)

> #3  0x40b4b510 in ExceptionalCondition (
> conditionName=0x40d76390 "!(PrivateRefCount[i] == 0)",
> errorType=0x40d06500 "FailedAssertion",
> fileName=0x40d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
> #4  0x407a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
> #5  0x407c4e50 in shmem_exit (code=1) at ipc.c:221
>
> in the log. So it seems like it also could be related to locking
> changes although I don't immediately see why.

No such "luck" though, the animal only applied the elog commits:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curious buildfarm failures

2013-01-14 Thread Andres Freund
On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> a few buildfarm failures along the lines of
>   
>   -- Commit table drop
>   COMMIT PREPARED 'regress-two';
> ! PANIC:  failed to re-find shared proclock object
> ! PANIC:  failed to re-find shared proclock object
> ! connection to server was lost
> 
> Evidently I bollixed something, but what?  I've been unable to reproduce
> this locally so far.  Anybody see what's wrong?
> 
> Another thing is that dugong has been reproducibly failing with
> 
>  drop cascades to table testschema.atable
>   -- Should succeed
>   DROP TABLESPACE testspace;
> + ERROR:  tablespace "testspace" is not empty
> 
> since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> some local problem there but I'm suspicious that there's a connection.
> But what?
> 
> Any insights out there?

It also has:

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
FATAL:  could not open file "base/16384/28182": No such file or directory
CONTEXT:  writing block 6 of relation base/16384/28182
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 
1743)
LOG:  checkpointer process (PID 30366) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
LOG:  abnormal database system shutdown


== stack trace: pgsql.9958/src/test/regress/tmp_check/data/core 
==
Using host libthread_db library "/lib/tls/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Core was generated by `postgres: checkpointer process   
 '.
Program terminated with signal 6, Aborted.

#0  0xa0010620 in __kernel_syscall_via_break ()
#0  0xa0010620 in __kernel_syscall_via_break ()
#1  0x20953bb0 in raise () from /lib/tls/libc.so.6.1
#2  0x20956df0 in abort () from /lib/tls/libc.so.6.1
#3  0x40b4b510 in ExceptionalCondition (
conditionName=0x40d76390 "!(PrivateRefCount[i] == 0)", 
errorType=0x40d06500 "FailedAssertion", 
fileName=0x40d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
#4  0x407a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
#5  0x407c4e50 in shmem_exit (code=1) at ipc.c:221
#6  0x407c4fa0 in proc_exit_prepare (code=1) at ipc.c:181
#7  0x407c4ab0 in proc_exit (code=1) at ipc.c:96
#8  0x40b5d390 in errfinish (dummy=0) at elog.c:518
#9  0x40823380 in _mdfd_getseg (reln=0x60155420, 
forknum=1397792, blkno=6, skipFsync=0 '\0', behavior=EXTENSION_FAIL)
at md.c:577
#10 0x4081e5c0 in mdwrite (reln=0x60155420, 
forknum=MAIN_FORKNUM, blocknum=6, buffer=0x21432ea0 "", 
skipFsync=0 '\0') at md.c:735
#11 0x40824690 in smgrwrite (reln=0x60155420, 
forknum=MAIN_FORKNUM, blocknum=6, buffer=0x21432ea0 "", 
skipFsync=0 '\0') at smgr.c:534
#12 0x4079e510 in FlushBuffer (buf=0x1, reln=0x60155420)
at bufmgr.c:1941
#13 0x407a10b0 in SyncOneBuffer (buf_id=0, skip_recently_used=0 '\0')
at bufmgr.c:1677
#14 0x407a0c00 in CheckPointBuffers (flags=5) at bufmgr.c:1284
#15 0x401fcbf0 in CheckPointGuts (checkPointRedo=80827000, flags=5)
at xlog.c:7391
#16 0x401fb2a0 in CreateCheckPoint (flags=5) at xlog.c:7240
#17 0x401f6820 in ShutdownXLOG (code=14699520, 
arg=4611686018440093920) at xlog.c:6823
#18 0x4072d780 in _setjmp_lpad_CheckpointerMain_0$0$18 ()
at checkpointer.c:413
#19 0x40235810 in AuxiliaryProcessMain (argc=496536, 
argv=0x6f80e520) at bootstrap.c:433
#20 0x407172b0 in StartChildProcess (type=508288) at postmaster.c:4956
#21 0x40713f50 in reaper (postgres_signal_arg=30365)
at postmaster.c:2568
#22 
#23 0xa0010620 in __kernel_syscall_via_break ()
#24 0x20953f70 in sigprocmask () from /lib/tls/libc.so.6.1
#25 0x40720480 in ServerLoop () at postmaster.c:1521
#26 0x4071d9d0 in PostmasterMain (argc=6, argv=0x600d85e0)
at postmaster.c:1244
#27 0x40577a30 in main (argc=6, argv=0x600d8010) at main.c:197

in the log. So it seems like it also could be related to locking
changes although I don't immediately see why.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Curious buildfarm failures

2013-01-14 Thread Tom Lane
Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of
  
  -- Commit table drop
  COMMIT PREPARED 'regress-two';
! PANIC:  failed to re-find shared proclock object
! PANIC:  failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what?  I've been unable to reproduce
this locally so far.  Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with

 drop cascades to table testschema.atable
  -- Should succeed
  DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty

since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers