On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrews...@alaska.edu> wrote:
> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbro...@gmail.com> > wrote: > > On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrews...@alaska.edu> > wrote: > >> Thanks to a change in historical data, I have a need to update a large >> number of records (around 50 million). The update itself is straight >> forward, as I can just issue an "UPDATE table_name SET >> changed_field=new_value();" (yes, new_value is the result of a stored >> procedure, if that makes a difference) command via psql, and it should >> work. However, due to the large number of records this command will >> obviously take a while, and if anything goes wrong during the update (one >> bad value in row 45 million, lost connection, etc), all the work that has >> been done already will be lost due to the transactional nature of such >> commands (unless I am missing something). >> >> Given that each row update is completely independent of any other row, I >> have the following questions: >> >> 1) Is there any way to set the command such that each row change is >> committed as it is calculated? >> 2) Is there some way to run this command in parallel in order to better >> utilize multiple processor cores, other than manually breaking the data >> into chunks and running a separate psql/update process for each chunk? >> Honestly, manual parallelizing wouldn’t be too bad (there are a number of >> logical segregations I can apply), I’m just wondering if there is a more >> automatic option. >> > > Yeah, I'd be inclined to do this in batches. > > If, for instance, the table has a nice primary key, then I'd capture the > primary keys into a side table, and grab tuples from the side table to > process in more bite-sized batches, say, of a few thousand tuples per batch. > > create table just_keys as select pk_column from big_historical_table; > alter table just_keys add column processed boolean; > create index jkpk on just_keys(pk_column) where (processed is null); > then loop repeatedly along the lines... > > create temp table iteration as select pk_column from just_keys where > processed is null limit 1000; > [do update on big_historical_table where pk_column in (select pk_column > from iteration)] > update iteration set processed='true' where pk_column in (select pk_column > from iteration); > drop table iteration; > > > Parallelization is absolutely an interesting idea; if you want to use 8 > processes, then use a cycling sequence on the side table to spread tuples > across the 8 processes, so that they can grab their own tuples and not > block one another. > > In that case, more like... > create temp sequence seq_procs start with 1 maxval 8 cycle; > create temp table just_keys as select pk_column, false::boolean as > processed, nextval('seq_procs') as batch_id from big_historical_table; > > The individual iterations then look for values in just_keys corresponding > to their assigned batch number. > > > Sounds like a reasonable approach. As Justin pointed out, it is actually > likely that the process will be IO bound rather than CPU bound, so my > parallel idea may not have much merit after all, but the batching procedure > makes sense. I assume you meant update just_keys in your sample rather than > update iteration on that line just before drop table iteration. Thanks for > the info > As for parallelism, if you have really powerful disk, lots of disks on disk array, it may help. Or not, as commented. I didn't test my wee bit of code, so yep, I meant to update just_keys :-). You won't find something terribly much more automatic. Oh, yah, there's a possible further complication; does the application need to get stopped to do this update? Is the newest version of the app still generating data that needs the rewriting? Sure hope not...