On Fri, Jan 5, 2018 at 5:00 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > The early returns indicate that that problem is fixed;
Thanks for your help and patience with that. I've made a list over here so we don't lose track of the various things that should be improved in this area, and will start a new thread when I have patches to propose: https://wiki.postgresql.org/wiki/Parallel_Hash > but now that the > noise level is down, it's possible to see that brolga is showing an actual > crash in the PHJ test, perhaps one time in four. So we're not out of > the woods yet. It seems to consistently look like this: > > 2017-12-21 17:34:52.092 EST [2252:4] LOG: background worker "parallel > worker" (PID 3584) was terminated by signal 11 > 2017-12-21 17:34:52.092 EST [2252:5] DETAIL: Failed process was running: > select count(*) from foo > left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss > on foo.id < ss.id + 1 and foo.id > ss.id - 1; > 2017-12-21 17:34:52.092 EST [2252:6] LOG: terminating any other active > server processes That is a test of a parallel-aware hash join with a rescan (ie workers get restarted repeatedly by the gather node reusing the DSM; maybe I misunderstood some detail of the protocol for that). I'll go and review that code and try to reproduce the failure. On the off-chance, Andrew, is there any chance you have a core dump you could pull a backtrace out of, on brolga? -- Thomas Munro http://www.enterprisedb.com