[
https://issues.apache.org/jira/browse/HAWQ-166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lei Chang updated HAWQ-166:
---------------------------
Affects Version/s: 2.0.0-beta-incubating
> Orphaned share input scan QE process after OLAP grouping query
> --------------------------------------------------------------
>
> Key: HAWQ-166
> URL: https://issues.apache.org/jira/browse/HAWQ-166
> Project: Apache HAWQ
> Issue Type: Bug
> Components: Query Execution
> Affects Versions: 2.0.0-beta-incubating
> Reporter: Ruilong Huo
> Assignee: Ruilong Huo
> Fix For: 2.0.0
>
> Attachments: olap_setup.sql
>
>
> Some OLAP grouping query may error out with "division by zero", and then
> leaves orphaned share input scan QE process on segment hosts. It causes HAWQ
> fails to shutdown, restart, etc.
> {code}
> [gpadmin@test2 ~]$ ps -ef | grep postgres
> gpadmin 115703 1 0 Nov15 ? 00:02:32
> /data/pulse-agent-data/HAWQ-main-FeatureTest-dbg-mutilnode-huor/product/hawq-2.0.0.0/bin/postgres
> -D
> /data/pulse-agent-data/HAWQ-main-FeatureTest-dbg-mutilnode-huor/product/segmentdd
> -i -M segment -p 21100 --silent-mode=true
> gpadmin 115704 115703 0 Nov15 ? 00:06:10 postgres: port 21100,
> logger process
> gpadmin 115707 115703 0 Nov15 ? 00:00:01 postgres: port 21100, stats
> collector process
> gpadmin 115708 115703 0 Nov15 ? 00:00:17 postgres: port 21100,
> writer process
> gpadmin 115709 115703 0 Nov15 ? 00:00:03 postgres: port 21100,
> checkpoint process
> gpadmin 115710 115703 0 Nov15 ? 00:00:38 postgres: port 21100,
> segment resource manager
> gpadmin 118013 115703 0 Nov15 ? 00:00:28 postgres: port 21100,
> gpadmin olap_group 10.32.35.205(46459) con58 seg1 cmd2 slice10 MPPEXEC SELECT
> gpadmin 118611 115703 0 Nov15 ? 00:00:28 postgres: port 21100,
> gpadmin olap_group 10.32.35.205(47395) con63 seg5 cmd2 slice36 MPPEXEC SELECT
> gpadmin 173633 115703 0 Nov15 ? 00:00:28 postgres: port 21100,
> gpadmin olap_group 10.32.35.205(3398) con616 seg3 cmd2 slice11 MPPEXEC SELECT
> gpadmin 180033 115703 0 Nov15 ? 00:00:28 postgres: port 21100,
> gpadmin olap_group 10.32.35.205(13067) con676 seg1 cmd2 slice16 MPPEXEC SELECT
> gpadmin 223009 115703 0 Nov15 ? 00:00:27 postgres: port 21100,
> gpadmin olap_group 10.32.35.205(13860) con1098 seg7 cmd2 slice21 MPPEXEC
> SELECT
> gpadmin 414471 414449 0 17:46 pts/1 00:00:00 grep postgres
> [gpadmin@test2 ~]$ sudo gdb -p 173633
> (gdb) bt
> #0 0x0000003c1e8e15e3 in select () from /lib64/libc.so.6
> #1 0x000000000076724d in shareinput_writer_waitdone (ctxt=0x2b6cc10,
> share_id=0, nsharer_xslice=4) at nodeShareInputScan.c:1011
> #2 0x000000000075b241 in ExecEndMaterial (node=0x2a98700) at
> nodeMaterial.c:496
> #3 0x00000000007250b0 in ExecEndNode (node=0x2a98700) at execProcnode.c:1690
> #4 0x000000000076539c in ExecEndShareInputScan (node=0x2a98250) at
> nodeShareInputScan.c:367
> #5 0x000000000072506a in ExecEndNode (node=0x2a98250) at execProcnode.c:1681
> #6 0x00000000007684e0 in ExecEndSort (node=0x2a97c20) at nodeSort.c:588
> #7 0x00000000007250f6 in ExecEndNode (node=0x2a97c20) at execProcnode.c:1696
> #8 0x000000000074b0c1 in ExecEndAgg (node=0x2a61df8) at nodeAgg.c:2644
> #9 0x0000000000725107 in ExecEndNode (node=0x2a61df8) at execProcnode.c:1700
> #10 0x00000000007639c1 in ExecEndRepeat (node=0x2a608e0) at nodeRepeat.c:218
> #11 0x000000000072517e in ExecEndNode (node=0x2a608e0) at execProcnode.c:1728
> #12 0x0000000000745ad5 in ExecEndAppend (node=0x2a603b8) at nodeAppend.c:411
> #13 0x0000000000724ec9 in ExecEndNode (node=0x2a603b8) at execProcnode.c:1582
> #14 0x0000000000745ad5 in ExecEndAppend (node=0x2a5ff08) at nodeAppend.c:411
> #15 0x0000000000724ec9 in ExecEndNode (node=0x2a5ff08) at execProcnode.c:1582
> #16 0x0000000000760997 in ExecEndMotion (node=0x2a79950) at nodeMotion.c:1232
> #17 0x000000000072516d in ExecEndNode (node=0x2a79950) at execProcnode.c:1724
> #18 0x000000000071c9e9 in ExecEndPlan (planstate=0x2a79950, estate=0x29d4318)
> at execMain.c:2891
> #19 0x0000000000719952 in ExecutorEnd (queryDesc=0x29cc208) at execMain.c:1365
> #20 0x00000000006b452b in PortalCleanupHelper (portal=0x29e2528) at
> portalcmds.c:347
> #21 0x00000000006b4454 in PortalCleanup (portal=0x29e2528) at portalcmds.c:299
> #22 0x0000000000a1c8ca in AtAbort_Portals () at portalmem.c:677
> #23 0x0000000000521b9b in AbortTransaction () at xact.c:2819
> #24 0x0000000000522753 in AbortCurrentTransaction () at xact.c:3332
> #25 0x00000000008fc43d in PostgresMain (argc=276, argv=0x2912038,
> username=0x28f57f8 "gpadmin") at postgres.c:4469
> #26 0x000000000089f48a in BackendRun (port=0x28bde30) at postmaster.c:5844
> #27 0x000000000089e914 in BackendStartup (port=0x28bde30) at postmaster.c:5437
> #28 0x0000000000899032 in ServerLoop () at postmaster.c:2139
> #29 0x0000000000898108 in PostmasterMain (argc=9, argv=0x28a6130) at
> postmaster.c:1431
> #30 0x00000000007b199a in main (argc=9, argv=0x28a6130) at main.c:226
> {code}
> The reproduction steps are:
> {code}
> Step 1: Prepare schema and data using attached olap_setup.sql
> Step 2: Run below OLAP grouping query
> -- OLAP query involving MAX() function
> SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn),
> TO_CHAR(COALESCE(MAX(DISTINCT
> floor(sale.vn+sale.qty)),0),'99999999.9999999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'99999999.9999999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'99999999.9999999')
> FROM sale,customer,vendor
> WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> GROUP BY
> ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn
> HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)