[
https://issues.apache.org/jira/browse/HAWQ-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lin Wen updated HAWQ-1284:
--------------------------
Attachment: hawq-2017-01-17_054054.csv
master's log file
> HAWQ master is coredump when kill all process on master and standby
> -------------------------------------------------------------------
>
> Key: HAWQ-1284
> URL: https://issues.apache.org/jira/browse/HAWQ-1284
> Project: Apache HAWQ
> Issue Type: Bug
> Reporter: Lin Wen
> Assignee: Ed Espino
> Attachments: hawq-2017-01-17_054054.csv
>
>
> When hawq cluster is running(no active queries), kill all postgres processes
> in master(with command "killall postgres") and then kill all processes in
> standby(with command "killall gpsyncmaster"), hawq master will generate
> coredump randomly.
> The callstack is:
> #0 0x00000032214325e5 in raise () from /lib64/libc.so.6
> #1 0x0000003221433dc5 in abort () from /lib64/libc.so.6
> #2 0x00000000008cce7f in errfinish (dummy=Unhandled dwarf expression opcode
> 0xf3
> ) at elog.c:686
> #3 0x00000000008cf032 in elog_finish (elevel=Unhandled dwarf expression
> opcode 0xf3
> ) at elog.c:1463
> #4 0x00000000007d4912 in proc_exit_prepare (code=1) at ipc.c:153
> #5 0x00000000007d4a38 in proc_exit (code=1) at ipc.c:93
> #6 0x00000000008ccc7e in errfinish (dummy=Unhandled dwarf expression opcode
> 0xf3
> ) at elog.c:670
> #7 0x000000000078dea1 in ServiceDoConnect (listenerPort=64556,
> complain=Unhandled dwarf expression opcode 0xf3
> ) at service.c:165
> #8 0x00000000004efd5a in XLogQDMirrorWrite (WriteRqst=<value optimized out>,
> flexible=0 '\000', xlog_switch=0 '\000') at xlog.c:1981
> #9 XLogWrite (WriteRqst=<value optimized out>, flexible=0 '\000',
> xlog_switch=0 '\000') at xlog.c:2354
> #10 0x00000000004f2242 in XLogFlush (record=...) at xlog.c:2572
> #11 0x00000000004f7288 in CreateCheckPoint (shutdown=Unhandled dwarf
> expression opcode 0xf3
> ) at xlog.c:8136
> #12 0x00000000004f9f72 in ShutdownXLOG (code=Unhandled dwarf expression
> opcode 0xf3
> ) at xlog.c:7865
> #13 0x000000000078b2b0 in BackgroundWriterMain () at bgwriter.c:318
> #14 0x000000000055a870 in AuxiliaryProcessMain (argc=<value optimized out>,
> argv=0x7fff02330850) at bootstrap.c:467
> #15 0x000000000079b4f0 in StartChildProcess (type=Unhandled dwarf expression
> opcode 0xf3
> ) at postmaster.c:6836
> #16 0x000000000079b7aa in CommenceNormalOperations () at postmaster.c:3618
> #17 0x000000000079fee4 in do_reaper () at postmaster.c:3831
> #18 ServerLoop () at postmaster.c:2136
> #19 0x00000000007a2179 in PostmasterMain (argc=Unhandled dwarf expression
> opcode 0xf3
> ) at postmaster.c:1454
> #20 0x00000000004a4f99 in main (argc=9, argv=0x2a4f010) at main.c:226
> The reason is the "WAL Send Server process" is killed firstly, when writer
> process gets a shutdown request, it begins to create a checkpoint and sync
> xlog to standby master, however at this point, wal send server process has
> been killed. So writer process failed in connecting wal send server process,
> then ereport ERROR,
> ereport(ERROR,
> (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
> errmsg("Could
> not connect to '%s': %s",
>
> serviceConfig->title,
>
> strerror(saved_err))));
> line:165, service.c
> From the call stack we can see, when ereport() is called, proc_exit_prepare()
> will be called. And at line:152, CritSectionCount is larger than 0, so PANIC
> occurs and a coredump is generated. CritSectionCount is added when writer
> process calls XLogFlush().
> if (CritSectionCount > 0)
> elog(PANIC, "process is dying from critical section");
>
> A possible solution is before writer process write log to standby, check if
> wal send server process exists. If not, don't call call
> WalSendServerClientConnect() to connect wal send server process.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)