On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0  0x0000003969632625 in *__GI_raise (sig=6) at
#1  0x0000003969633e05 in *__GI_abort () at abort.c:92
#2  0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
    lineNumber=345) at assert.c:54
#3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4  0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5  0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6  0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7  0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8  0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9  0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
    numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
                      QUERY PLAN
   Sort Key: ((ptab.tableoid)::regclass)
   ->  HashAggregate
         Group Key: (ptab.tableoid)::regclass, ptab.a
         ->  Result
               ->  Append
                     ->  Foreign Scan on ptab_00001
                     ->  Foreign Scan on ptab_00002
                     ->  Foreign Scan on ptab_00003
                     ->  Foreign Scan on ptab_00004
                     ->  Foreign Scan on ptab_00005
                     ->  Foreign Scan on ptab_00006
                     ->  Foreign Scan on ptab_00007
                     ->  Foreign Scan on ptab_00008
                     ->  Foreign Scan on ptab_00009
                     ->  Foreign Scan on ptab_00010

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR:  relation "public.ptab_00010" does not exist
CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301                                     lastidx = resarr->lastidx;
(gdb) bt
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1  0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3  0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
    at resowner.c:566
#4  0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5  0x0000000000524172 in AbortTransaction () at xact.c:2588
#6  0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7  0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8  0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9  0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600                     ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
  nevents = 2139062143,
  nevents_space = 2139062143,
  resowner = 0x7f7f7f7f7f7f7f7f,
  events = 0x7f7f7f7f7f7f7f7f,
  latch = 0x7f7f7f7f7f7f7f7f,
  latch_pos = 2139062143,
  epoll_fd = 2139062143,
  epoll_ret_events = 0x7f7f7f7f7f7f7f7f


