[ https://issues.apache.org/jira/browse/HAWQ-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206036#comment-15206036 ]
Ming LI commented on HAWQ-575: ------------------------------ The log is below: 2016-03-19 23:44:54.621653 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p1372,th-1217652544,"172.28.8.250","15627",2016-03-19 22:52:05 PDT,1172392,con92730,cmd2,seg97,slice5,,x1172392,sx1,"ERROR","58030","could not read from temporary file: Input/output error",,,,,,"select ... 2016-03-19 23:44:54.649381 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p797594,th-1217652544,"172.28.8.250","14688",2016-03-19 22:48:14 PDT,1172286,con92501,cmd2,seg101,slice7,,x1172286,sx1,"FATAL","08006","connection to client lost",,,,,,,0,,"postgres.c",3512, 2016-03-19 23:44:54.675656 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19 22:53:40 PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"ERROR","58030","could not read from temporary file: Input/output error",,,,,,"select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%aquamarine%' ) as profit group by nation, o_year order by nation, o_year desc;",0,,"compress_nothing.c",61, 2016-03-19 23:44:54.683709 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19 22:53:40 PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"ERROR","58030","could not close temporary file /data21/tmp/pgsql_tmp/workfile_set_HashJoin_Slice5.XXXXSzweO6/spillfile_f95: Input/output error",,,,,,,0,,"bfz.c",466, 2016-03-19 23:44:54.689898 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19 22:53:40 PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"WARNING","58030","could not close temporary file /data21/tmp/pgsql_tmp/workfile_set_HashJoin_Slice5.XXXXSzweO6/spillfile_f123: Input/output error",,,,,,,0,,"bfz.c",466, 2016-03-19 23:45:08.582441 PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19 22:53:40 PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"PANIC","XX000","Resume interrupt holdoff count is bad (0) (xact.c:2907)",,,,,,,0,,"xact.c",2907,"Stack trace: 1 0x871f7f postgres <symbol not found> + 0x871f7f 2 0x872659 postgres elog_finish + 0xa9 3 0x4e171b postgres AbortTransaction + 0x7cb 4 0x4e2c45 postgres AbortCurrentTransaction + 0x25 5 0x7b01ea postgres PostgresMain + 0xaba 6 0x763c03 postgres <symbol not found> + 0x763c03 7 0x76435d postgres <symbol not found> + 0x76435d 8 0x76618e postgres PostmasterMain + 0xc7e 9 0x6c028a postgres main + 0x48a 10 0x33d401ed1d libc.so.6 __libc_start_main + 0xfd 11 0x4a17e9 postgres <symbol not found> + 0x4a17e9 >From the log above, the root cause is: 1) con92730,cmd2,seg97,slice5 reported: could not read from temporary file: Input/output error 2) So the transaction will be aborted. Master node will send SIGQUIT to all processes on segment and quit 3) con92501,cmd2,seg101,slice7: before processing SIGQUIT, it first detect that connection to QD error, so report FATAL. 4) con92825,cmd2,seg97,slice5: why 2 occurrence of report error here? Maybe the second error is called in the AbortTransaction() which will set InterruptHoldoffCount to 0. > QE core dumped when report "Resume interrupt holdoff count is bad (0) > (xact.c:2907)" > ------------------------------------------------------------------------------------ > > Key: HAWQ-575 > URL: https://issues.apache.org/jira/browse/HAWQ-575 > Project: Apache HAWQ > Issue Type: Bug > Reporter: Ming LI > Assignee: Lei Chang > > Core was generated by `postgres: port 5532, gpadmin tpch_row_2... > 172.28.8.250(18226) con92825 seg97'. > Program terminated with signal 6, Aborted. > #0 0x00000033d4032925 in raise () from /lib64/libc.so.6 > Missing separate debuginfos, use: debuginfo-install > hawq-2.0.0.0_beta-20925.x86_64 > (gdb) bt > #0 0x00000033d4032925 in raise () from /lib64/libc.so.6 > #1 0x00000033d4034105 in abort () from /lib64/libc.so.6 > #2 0x0000000000871c6e in errfinish (dummy=<value optimized out>) at > elog.c:682 > #3 0x00000000008727bb in elog_finish (elevel=<value optimized out>, > fmt=<value optimized out>) at elog.c:1459 > #4 0x00000000004e171b in AbortTransaction () at xact.c:2907 > #5 0x00000000004e2c45 in AbortCurrentTransaction () at xact.c:3377 > #6 0x00000000007b01ea in PostgresMain (argc=37474312, argv=0x0, > username=<value optimized out>) at postgres.c:4507 > #7 0x0000000000763c03 in BackendRun (port=0x2373210) at postmaster.c:5889 > #8 BackendStartup (port=0x2373210) at postmaster.c:5484 > #9 0x000000000076435d in ServerLoop () at postmaster.c:2163 > #10 0x000000000076618e in PostmasterMain (argc=9, argv=0x236a5b0) at > postmaster.c:1454 > #11 0x00000000006c028a in main (argc=9, argv=0x236a570) at main.c:226 > (gdb) f 3 > #3 0x00000000008727bb in elog_finish (elevel=<value optimized out>, > fmt=<value optimized out>) at elog.c:1459 > (gdb) p *edata > $1 = {elevel = 22, output_to_server = 1 '\001', output_to_client = 1 '\001', > show_funcname = 0 '\000', omit_location = 0 '\000', fatal_return = 0 '\000', > hide_stmt = 0 '\000', send_alert = 1 '\001', filename = 0x9cc38e "xact.c", > lineno = 2907, funcname = 0x9c66c0 "AbortTransaction", > domain = 0xafb668 "postgres-8.2", sqlerrcode = 2600, message = 0x236da50 > "Resume interrupt holdoff count is bad (0) (xact.c:2907)", detail = 0x0, > detail_log = 0x0, hint = 0x0, context = 0x0, cursorpos = 0, internalpos = > 0, internalquery = 0x0, saved_errno = 5, stacktracearray = {0x871f7f, > 0x872659, > 0x4e171b, 0x4e2c45, 0x7b01ea, 0x763c03, 0x76435d, 0x76618e, 0x6c028a, > 0x33d401ed1d, 0x4a17e9, 0x0 <repeats 19 times>}, stacktracesize = 11, > printstack = 0 '\000'} -- This message was sent by Atlassian JIRA (v6.3.4#6332)