On Thursday, May 20, 2021 9:59 PM Amit Langote <[email protected]> wrote:
> Here are updated/divided patches.
Thanks for your updates.
But, I've detected segmentation faults caused by the patch,
which can happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.
This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while
OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in
a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from
v3.
* The message of the failure during TAP test.
# Postmaster PID for node "twoways" is 5015
Waiting for replication conn testsub's replay_lsn to pass pg_current_wal_lsn()
on twoways
# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'testsub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# psql: error: connection to server on socket "/tmp/cs8dhFOtZZ/.s.PGSQL.59345"
failed: No such file or directory
# Is the server running locally and accepting connections on that socket?
timed out waiting for catchup at t/100_bugs.pl line 148.
The failure produces core file and its back trace is below.
My first guess of the cause is that between the timing to get an entry from
hash_search() in get_rel_sync_entry()
and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had
invalidation message,
which tries to free unset descs in the entry ?
* core file backtrace
Core was generated by `postgres: twoways: walsender k5user [local]
START_REPLICATION '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at
pgoutput.c:1225
1225 FreeTupleDesc(entry->map->indesc);
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64
libicu-50.2-4.el7_7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb) bt
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at
pgoutput.c:1225
#1 0x0000000000ae21f0 in LocalExecuteInvalidationMessage (msg=0x21d1de8) at
inval.c:601
#2 0x00000000008dbd6e in ReorderBufferExecuteInvalidations (nmsgs=4,
msgs=0x21d1db8) at reorderbuffer.c:3232
#3 0x00000000008da70a in ReorderBufferProcessTXN (rb=0x21d1a40, txn=0x2210b58,
commit_lsn=25569096, snapshot_now=0x21d1e10, command_id=1, streaming=false)
at reorderbuffer.c:2294
#4 0x00000000008dae56 in ReorderBufferReplay (txn=0x2210b58, rb=0x21d1a40,
xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619,
origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#5 0x00000000008daede in ReorderBufferCommit (rb=0x21d1a40, xid=748,
commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619, origin_id=0,
origin_lsn=0) at reorderbuffer.c:2615
#6 0x00000000008cae06 in DecodeCommit (ctx=0x21e6880, buf=0x7fffb9383db0,
parsed=0x7fffb9383c10, xid=748, two_phase=false) at decode.c:744
#7 0x00000000008ca1ed in DecodeXactOp (ctx=0x21e6880, buf=0x7fffb9383db0) at
decode.c:278
#8 0x00000000008c9e76 in LogicalDecodingProcessRecord (ctx=0x21e6880,
record=0x21e6c80) at decode.c:142
#9 0x0000000000901fcc in XLogSendLogical () at walsender.c:2876
#10 0x0000000000901327 in WalSndLoop (send_data=0x901f30 <XLogSendLogical>) at
walsender.c:2306
#11 0x00000000008ffd5f in StartLogicalReplication (cmd=0x219aff8) at
walsender.c:1206
#12 0x00000000009006ae in exec_replication_command (
cmd_string=0x2123c20 "START_REPLICATION SLOT
\"pg_16400_sync_16392_6964617299612181363\" LOGICAL 0/182D058 (proto_version
'2', publication_names '\"testpub\"')") at walsender.c:1646
#13 0x000000000096ffae in PostgresMain (argc=1, argv=0x7fffb93840d0,
dbname=0x214ef18 "d1", username=0x214eef8 "k5user") at postgres.c:4482
I'll update when I get more information.
Best Regards,
Takamichi Osumi