On Mon, 6 Jan 2020 at 17:38, Israel Brewster <ijbrews...@alaska.edu> wrote:

>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>

It sounds like you're in a decent place on this, and that you have done a
pretty apropos amount of exploration of the matter.

I was pleased to hear that you have the idempotency of the updates well in
hand, and that the application can cope with the degree of out-of-sync that
things will temporarily be.

The estimate of 10h to update the data doesn't surprise me; that's long
enough that it sure seems tempting to do the work in pieces so that you
don't have your whole set of application data locked for 10h.

I'd be inclined to call this "enough attention" for a one-off event.

I'll poke at the trigger aspect a wee bit; if the trigger function does a
one-tuple-at-a-time handling of things, so that it fires 50M times, you
might get a substantial speedup by replacing that with an equivalent set
operation that processes a few thousand tuples at a time.  That said, if
you're happy with the process running 10h, it's not worth unpeeling the
extra testing needed to ensure identical end states.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Reply via email to