Hi All, Recently while running tpc-h queries on postgresql master branch, I am noticed random server crash. Most of the time server crash coming while turn tpch query number 3 - (but its very random).
Call Stack of server crash: (gdb) bt #0 0x00000000102aa9ac in ExplainNode (planstate=0x1001a7471d8, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1516 #1 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a746b50, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #2 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a7468b8, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #3 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745e48, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #4 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745bb0, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #5 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745870, ancestors=0x1001a745348, relationship=0x0, plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #6 0x00000000102a81d8 in ExplainPrintPlan (es=0x1001a65ad48, queryDesc=0x1001a7442a0) at explain.c:599 #7 0x00000000102a7e20 in ExplainOnePlan (plannedstmt=0x1001a744208, into=0x0, es=0x1001a65ad48, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0, planduration=0x3fffe17c6488) at explain.c:515 #8 0x00000000102a7968 in ExplainOneQuery (query=0x1001a65ab98, into=0x0, es=0x1001a65ad48, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0) at explain.c:357 #9 0x00000000102a74b8 in ExplainQuery (stmt=0x1001a6fe468, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0, dest=0x1001a65acb0) at explain.c:245 (gdb) p *w $2 = {num_workers = 2139062143, instrument = 0x1001af8a8d0} (gdb) p planstate $3 = (PlanState *) 0x1001a7471d8 (gdb) p *planstate $4 = {type = T_HashJoinState, plan = 0x1001a740730, state = 0x1001a745758, instrument = 0x1001a7514a8, worker_instrument = 0x1001af8a8c8, targetlist = 0x1001a747448, qual = 0x0, lefttree = 0x1001a74a0e8, righttree = 0x1001a74f2f8, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x1001a750c10, ps_ExprContext = 0x1001a7472f0, ps_ProjInfo = 0x1001a751220, ps_TupFromTlist = 0 '\000'} (gdb) p *planstate->worker_instrument $5 = {num_workers = 2139062143, instrument = 0x1001af8a8d0} (gdb) p n $6 = 13055 (gdb) p n $7 = 13055 Here its clear that work_instrument is either corrupted or Un-inililized that is the reason its ending up with server crash. With bit more debugging and looked at git history I found that issue started coming with commit af33039317ddc4a0e38a02e2255c2bf453115fd2. gather_readnext() calls ExecShutdownGatherWorkers() when nreaders == 0. ExecShutdownGatherWorkers() calls ExecParallelFinish() which collects the instrumentation before marking ParallelExecutorInfo to finish. ExecParallelRetrieveInstrumentation() do the allocation of planstate->worker_instrument. With commit af33039317 now we calling the gather_readnext() with per-tuple context, but with nreader == 0 with ExecShutdownGatherWorkers() we end up with allocation of planstate->worker_instrument into per-tuple context - which is wrong. Now fix can be: 1) Avoid calling ExecShutdownGatherWorkers() from the gather_readnext() and let ExecEndGather() do that things. But with this change, gather_readread() and gather_getnext() depend on planstate->reader structure to continue reading tuple. Now either we can change those condition to be depend on planstate->nreaders or just pfree(planstate->reader) into gather_readnext() instead of calling ExecShutdownGatherWorkers(). 2) Teach ExecParallelRetrieveInstrumentation() to do allocation of planstate->worker_instrument into proper memory context. Attaching patch, which fix the issue with approach 1). Regards. Rushabh Lathia www.EnterpriseDB.com
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c index 438d1b2..bc56f0e 100644 --- a/src/backend/executor/nodeGather.c +++ b/src/backend/executor/nodeGather.c @@ -348,7 +348,8 @@ gather_readnext(GatherState *gatherstate) --gatherstate->nreaders; if (gatherstate->nreaders == 0) { - ExecShutdownGatherWorkers(gatherstate); + pfree(gatherstate->reader); + gatherstate->reader = NULL; return NULL; } memmove(&gatherstate->reader[gatherstate->nextreader],
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers