[ 
https://issues.apache.org/jira/browse/HAWQ-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruilong Huo updated HAWQ-1371:
------------------------------
    Fix Version/s:     (was: backlog)
                   2.2.0.0-incubating

> QE process hang in shared input scan
> ------------------------------------
>
>                 Key: HAWQ-1371
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1371
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Query Execution
>    Affects Versions: 2.1.0.0-incubating
>            Reporter: Amy
>            Assignee: Amy
>             Fix For: 2.2.0.0-incubating
>
>
> process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> on segment test2:
> [gpadmin@test2 ~]$ pp
> gpadmin   21614  0.0  1.2 788636 407428 ?       Ss   Feb26   1:19 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-YARN/product/segmentdd -p 
> 31100 --silent-mode=true -M segment -i
> gpadmin   21615  0.0  0.0 279896  6952 ?        Ss   Feb26   0:08 postgres: 
> port 31100, logger process
> gpadmin   21618  0.0  0.0 282128  6980 ?        Ss   Feb26   0:00 postgres: 
> port 31100, stats collector process
> gpadmin   21619  0.0  0.0 788636  7280 ?        Ss   Feb26   0:11 postgres: 
> port 31100, writer process
> gpadmin   21620  0.0  0.0 788636  7064 ?        Ss   Feb26   0:01 postgres: 
> port 31100, checkpoint process
> gpadmin   21621  0.0  0.0 793048 11752 ?        S    Feb26   0:19 postgres: 
> port 31100, segment resource manager
> gpadmin   91760  0.0  0.0 861000 16840 ?        TNsl Feb26   0:07 postgres: 
> port 31100, gpadmin parquetola... 10.32.35.141(15250) con558 seg4 cmd2 
> slice11 MPPEXEC SELECT
> gpadmin   91762  0.0  0.0 861064 17116 ?        SNsl Feb26   0:08 postgres: 
> port 31100, gpadmin parquetola... 10.32.35.141(15253) con558 seg5 cmd2 
> slice11 MPPEXEC SELECT
> gpadmin  216648  0.0  0.0 103244   788 pts/0    S+   19:54   0:00 grep 
> postgres
> {code}
> QE stack trace is:
> {code}
> (gdb) bt
> #0  0x00000032214e1523 in select () from /lib64/libc.so.6
> #1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, 
> share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989
> #2  0x0000000000695798 in ExecEndMaterial (node=0x1d2eb50) at 
> nodeMaterial.c:512
> #3  0x000000000067048d in ExecEndNode (node=0x1d2eb50) at execProcnode.c:1681
> #4  0x000000000069c6b5 in ExecEndShareInputScan (node=0x1d2e6f0) at 
> nodeShareInputScan.c:382
> #5  0x000000000067042a in ExecEndNode (node=0x1d2e6f0) at execProcnode.c:1674
> #6  0x00000000006ac9be in ExecEndSequence (node=0x1d23890) at 
> nodeSequence.c:165
> #7  0x00000000006705f0 in ExecEndNode (node=0x1d23890) at execProcnode.c:1583
> #8  0x000000000069a0ab in ExecEndResult (node=0x1d214a0) at nodeResult.c:481
> #9  0x000000000067060d in ExecEndNode (node=0x1d214a0) at execProcnode.c:1575
> #10 0x000000000069a0ab in ExecEndResult (node=0x1d20860) at nodeResult.c:481
> #11 0x000000000067060d in ExecEndNode (node=0x1d20860) at execProcnode.c:1575
> #12 0x0000000000698fd2 in ExecEndMotion (node=0x1d20320) at nodeMotion.c:1230
> #13 0x0000000000670434 in ExecEndNode (node=0x1d20320) at execProcnode.c:1713
> #14 0x0000000000669da7 in ExecEndPlan (planstate=0x1d20320, estate=0x1cb6b40) 
> at execMain.c:2896
> #15 0x000000000066a311 in ExecutorEnd (queryDesc=0x1cabf20) at execMain.c:1407
> #16 0x00000000006195f2 in PortalCleanupHelper (portal=0x1cbcc40) at 
> portalcmds.c:365
> #17 PortalCleanup (portal=0x1cbcc40) at portalcmds.c:317
> #18 0x0000000000900544 in AtAbort_Portals () at portalmem.c:693
> #19 0x00000000004e697f in AbortTransaction () at xact.c:2800
> #20 0x00000000004e7565 in AbortCurrentTransaction () at xact.c:3377
> #21 0x00000000007ed0fa in PostgresMain (argc=<value optimized out>, 
> argv=<value optimized out>, username=0x1b47f10 "gpadmin") at postgres.c:4630
> #22 0x00000000007a05d0 in BackendRun () at postmaster.c:5915
> #23 BackendStartup () at postmaster.c:5484
> #24 ServerLoop () at postmaster.c:2163
> #25 0x00000000007a3399 in PostmasterMain (argc=Unhandled dwarf expression 
> opcode 0xf3
> ) at postmaster.c:1454
> #26 0x00000000004a52e9 in main (argc=9, argv=0x1b0cd10) at main.c:226
> (gdb) p CurrentTransactionState->state
> $1 = TRANS_ABORT
> (gdb) p pctxt->donefd
> No symbol "pctxt" in current context.
> (gdb) f 1
> #1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, 
> share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989
> 989           nodeShareInputScan.c: No such file or directory.
>               in nodeShareInputScan.c
> (gdb) p pctxt->donefd
> $2 = 15
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to