On 2016-04-25 12:38:36 -0400, Robert Haas wrote:
> I think that the point of my message is exactly what I said in my
> message.  This isn't really about the last couple of days.  The issue
> was reported on March 20th.  On March 31st, Noah asked you for a plan
> to get it fixed by April 7th.  You never replied.  On April 16th, the
> issue not having been fixed, he followed up.  You said that you would
> fix it next week.  That week is now over, and we're into the following
> week.

Well, I posted a patch. I'd have applied it too (after addressing your
comments obviously), except that there's some interdependencies with the
nsmg > 0 thread (some of my tests fail spuriously without that
fixed). Early last week I waited for a patch on that thread, but when
that didn't materialize by Friday I switched to work on that [1].  With
both fixes applied I can't reproduce any problems anymore.

About the delay: Sure, it'd be nicer if I'd addressed this
immediately. But priority-wise it's an issue that's damned hard to hit,
and where the worst known consequence is having to reconnect; that's not
earth shattering. So spending time to work on the, imo more severe
performance regressions, seemed to be more important; maybe I was wrong
in priorizing things that way.

> We have a patch, and that's good, and I have reviewed it and
> Thom has tested it, and that's good, too.  But it is not clear whether
> you feel confident to commit it or when you might be planning to do
> that, so I asked.  Given that this is the open item of longest tenure
> and that we're hoping to ship beta soon, why is that out of line?

Well, if you'd asked it that way, then I'd not responded the way I have.

> We initially had a theory that the commit that caused this issue
> merely revealed an underlying problem that had existed before, but I
> no longer really think that's the case.

I do think there's some lingering problems (avoiding a FATAL by choosing
an index scan instead of a seqscan, the problem afaics can transiently
occur in HS, any corrupted index can trigger it ...) - but I agree it's
not really been a big problem so far.

> That commit introduced a new way to write to blocks that might have in
> the meantime been removed, and it failed to make that safe.

There's no writing of any blocks involved - the problem is just about
opening segments which might or might not exist.

> And in fact I think it's a regression that can be
> expected to be a significant operational problem for people if not
> fixed, because the circumstances in which it can happen are not very
> obscure.  You just need to hold some pending flush requests in your
> backend-local queue while some other process truncates the relation,
> and boom.

I found it quite hard to come up with scenarios to reproduce it. Entire
segments have to be dropped, the writes to the to-be-dropped segment
have to result in fully dead rows, only few further writes outside the
dropped can happen, invalidations may not fix the problem up.  But
obviously it should be fixed.

> I assume you will be pretty darned unhappy if we end up at #2, so I am
> trying to figure out if we can achieve #1.  OK?




Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to