On 11 April 2018 at 18:58, David Rowley <david.row...@2ndquadrant.com> wrote:
> On 10 April 2018 at 08:55, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
>>> David Rowley wrote:
>>>> Okay, I've written and attached a fix for this.  I'm not 100% certain
>>>> that this is the cause of the problem on pademelon, but the code does
>>>> look wrong, so needs to be fixed. Hopefully, it'll make pademelon
>>>> happy, if not I'll think a bit harder about what might be causing that
>>>> instability.
>>
>>> Pushed it just now.  Let's see what happens with pademelon now.
>>
>> I've had pademelon's host running a "make installcheck" loop all day
>> trying to reproduce the problem.  I haven't gotten a bite yet (although
>> at 15+ minutes per cycle, this isn't a huge number of tests).  I think
>> we were remarkably (un)lucky to see the problem so quickly after the
>> initial commit, and I'm afraid pademelon isn't going to help us prove
>> much about whether this was the same issue.
>>
>> This does remind me quite a bit though of the ongoing saga with the
>> postgres_fdw test instability.  Given the frequency with which that's
>> failing in the buildfarm, you would not think it's impossible to
>> reproduce outside the buildfarm, and yet I'm here to tell you that
>> it's pretty damn hard.  I haven't succeeded yet, and that's not for
>> lack of trying.  Could there be something about the buildfarm
>> environment that makes these sorts of things more likely?
>
> coypu just demonstrated that this was not the cause of the problem [1]
>
> I'll study the code a bit more and see if I can think why this might
> be happening.
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C

I've spent a bit of time tonight trying to dig into this problem to
see if I can figure out what's going on.

I ended up running the following script on both a Linux x86_64 machine
and also a power8 machine.

#!/bin/bash
for x in {1..1000}
do
    echo "$x";
    for i in {1..1000}
    do
        psql -d postgres -f test.sql -o test.out
        diff -u test.out test.expect
    done
done

I was unable to recreate this problem after about 700k loops on the
Linux machine and 130k loops on the power8.

I've emailed the owner of coypu to ask if it would be possible to get
access to the machine, or have him run the script to see if it does
actually fail. Currently waiting to hear back.

I've made another pass over the nodeAppend.c code and I'm unable to
see what might cause this, although I did discover a bug where
first_partial_plan is not set taking into account that some subplans
may have been pruned away during executor init. The only thing I think
this would cause is for parallel workers to not properly help out with
some partial plans if some earlier subplans were pruned. I can see no
reason for this to have caused this particular issue since the
first_partial_plan would be 0 with and without the attached fix.

Tom, would there be any chance you could run the above script for a
while on pademelon to see if it can in fact reproduce the problem?
coypu did show this problem in the install check, so I don't think it
will need the other concurrent tests to fail.  If you can recreate,
after adjusting the expected output, does the problem still exist in
5c0675215?

I also checked with other tests perform an EXPLAIN ANALYZE of a plan
with a Parallel Append and I see there's none. So I've not ruled out
that this is an existing bug. git grep "explain.*analyze" also does
not show much outside of the partition_prune tests either.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
set enable_indexonlyscan = off;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = 0;
set max_parallel_workers_per_gather = 2;

prepare ab_q5 (int, int, int) as
select avg(a) from ab where a in($1,$2,$3) and b < 4;

execute ab_q5 (1, 2, 3);
execute ab_q5 (1, 2, 3);
execute ab_q5 (1, 2, 3);
execute ab_q5 (1, 2, 3);
execute ab_q5 (1, 2, 3);

explain (analyze, costs off, summary off, timing off) execute ab_q5 (2, 3, 3);

Attachment: test.expect
Description: Binary data

Attachment: first_partial_plan_fix.patch
Description: Binary data

create table ab (a int not null, b int not null) partition by list (a);
create table ab_a2 partition of ab for values in(2) partition by list (b);
create table ab_a2_b1 partition of ab_a2 for values in (1);
create table ab_a2_b2 partition of ab_a2 for values in (2);
create table ab_a2_b3 partition of ab_a2 for values in (3);
create table ab_a1 partition of ab for values in(1) partition by list (b);
create table ab_a1_b1 partition of ab_a1 for values in (1);
create table ab_a1_b2 partition of ab_a1 for values in (2);
create table ab_a1_b3 partition of ab_a1 for values in (3);
create table ab_a3 partition of ab for values in(3) partition by list (b);
create table ab_a3_b1 partition of ab_a3 for values in (1);
create table ab_a3_b2 partition of ab_a3 for values in (2);
create table ab_a3_b3 partition of ab_a3 for values in (3);

Reply via email to