[jira] [Updated] (HAWQ-1284) HAWQ master is coredump when kill all process on master and standby

Lin Wen (JIRA) Thu, 19 Jan 2017 02:49:55 -0800

     [ 
https://issues.apache.org/jira/browse/HAWQ-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lin Wen updated HAWQ-1284:
--------------------------
    Attachment: hawq-2017-01-17_054054.csv

master's log file

> HAWQ master is coredump when kill all process on master and standby
> -------------------------------------------------------------------
>
>                 Key: HAWQ-1284
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1284
>             Project: Apache HAWQ
>          Issue Type: Bug
>            Reporter: Lin Wen
>            Assignee: Ed Espino
>         Attachments: hawq-2017-01-17_054054.csv
>
>
> When hawq cluster is running(no active queries), kill all postgres processes 
> in master(with command "killall postgres") and then kill all processes in 
> standby(with command "killall gpsyncmaster"), hawq master will generate 
> coredump randomly.
> The callstack is:
> #0  0x00000032214325e5 in raise () from /lib64/libc.so.6
> #1  0x0000003221433dc5 in abort () from /lib64/libc.so.6
> #2  0x00000000008cce7f in errfinish (dummy=Unhandled dwarf expression opcode 
> 0xf3
> ) at elog.c:686
> #3  0x00000000008cf032 in elog_finish (elevel=Unhandled dwarf expression 
> opcode 0xf3
> ) at elog.c:1463
> #4  0x00000000007d4912 in proc_exit_prepare (code=1) at ipc.c:153
> #5  0x00000000007d4a38 in proc_exit (code=1) at ipc.c:93
> #6  0x00000000008ccc7e in errfinish (dummy=Unhandled dwarf expression opcode 
> 0xf3
> ) at elog.c:670
> #7  0x000000000078dea1 in ServiceDoConnect (listenerPort=64556, 
> complain=Unhandled dwarf expression opcode 0xf3
> ) at service.c:165
> #8  0x00000000004efd5a in XLogQDMirrorWrite (WriteRqst=<value optimized out>, 
> flexible=0 '\000', xlog_switch=0 '\000') at xlog.c:1981
> #9  XLogWrite (WriteRqst=<value optimized out>, flexible=0 '\000', 
> xlog_switch=0 '\000') at xlog.c:2354
> #10 0x00000000004f2242 in XLogFlush (record=...) at xlog.c:2572
> #11 0x00000000004f7288 in CreateCheckPoint (shutdown=Unhandled dwarf 
> expression opcode 0xf3
> ) at xlog.c:8136
> #12 0x00000000004f9f72 in ShutdownXLOG (code=Unhandled dwarf expression 
> opcode 0xf3
> ) at xlog.c:7865
> #13 0x000000000078b2b0 in BackgroundWriterMain () at bgwriter.c:318
> #14 0x000000000055a870 in AuxiliaryProcessMain (argc=<value optimized out>, 
> argv=0x7fff02330850) at bootstrap.c:467
> #15 0x000000000079b4f0 in StartChildProcess (type=Unhandled dwarf expression 
> opcode 0xf3
> ) at postmaster.c:6836
> #16 0x000000000079b7aa in CommenceNormalOperations () at postmaster.c:3618
> #17 0x000000000079fee4 in do_reaper () at postmaster.c:3831
> #18 ServerLoop () at postmaster.c:2136
> #19 0x00000000007a2179 in PostmasterMain (argc=Unhandled dwarf expression 
> opcode 0xf3
> ) at postmaster.c:1454
> #20 0x00000000004a4f99 in main (argc=9, argv=0x2a4f010) at main.c:226
> The reason is the "WAL Send Server process" is killed firstly, when writer 
> process gets a shutdown request, it begins to create a checkpoint and sync 
> xlog to standby master, however at this point, wal send server process has 
> been killed. So writer process failed in connecting wal send server process, 
> then ereport ERROR, 
>                               ereport(ERROR, 
> (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
>                                                               errmsg("Could 
> not connect to '%s': %s",
>                                                                          
> serviceConfig->title,
>                                                                          
> strerror(saved_err))));
> line:165, service.c
> From the call stack we can see, when ereport() is called, proc_exit_prepare() 
> will be called. And at line:152, CritSectionCount is larger than 0, so PANIC 
> occurs and a coredump is generated. CritSectionCount is added when writer 
> process calls XLogFlush().
>       if (CritSectionCount > 0)
>               elog(PANIC, "process is dying from critical section");
>  
> A possible solution is before writer process write log to standby, check if 
> wal send server process exists. If not, don't call call 
> WalSendServerClientConnect() to connect wal send server process. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HAWQ-1284) HAWQ master is coredump when kill all process on master and standby

Reply via email to