Re: Logical replication, need to reclaim big disk space

Achilleas Mantzios Mon, 19 May 2025 11:50:21 -0700

On 19/5/25 17:38, Moreno Andreo wrote:

On 19/05/25 14:41, Achilleas Mantzios wrote:
On 5/19/25 09:14, Moreno Andreo wrote:
On 16/05/25 21:33, Achilleas Mantzios wrote:
On 16/5/25 18:45, Moreno Andreo wrote:
Hi,
we are moving our old binary data approach, moving them frombytea fields in a table to external storage (making databasesmaller and related operations faster and smarter).In short, we have a job that runs in background and copies datafrom the table to an external file and then sets the bytea fieldto NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)
This results, at the end of the operations, to a table that's lessthan one tenth in size.We have a multi-tenant architecture (100s of schemas withidentical architecture, all inheriting from public) and we areperforming the task on one table per schema.
So? toasted data are kept on separate TOAST tables, unless thosebytea cols are selected, you won't even touch them. I cannotunderstand what you are trying to achieve here.
Years ago, when I made the mistake to go for a coffee and let mydevelopers "improvise" , the result was a design similar to whatyou are trying to achieve. Years after, I am seriously consideringmoving those data back to PostgreSQL.
The "related operations" I was talking about are backups anddatabase maintenance when needed, cluster/replica management, etc.With a smaller database size they would be easier in timing andeffort, right?
Ok, but you'll lose replica functionality for those blobs, whichmeans you don't care about them, correct me if I am wrong.
I'm not saying I don't care about them, the opposite, they areprotected with Object Versioning and soft deletion, this should assurea good protection against e.g. ransomware, if someone manages to getin there (and if this happens, we'll have bigger troubles than this).

PostgreSQL has become very popular because of ppl who care about their data.

We are mostly talking about costs, here. To give things their names,I'm moving bytea contents (85% of total data) to files into GoogleCloud Storage buckets, that has a fraction of the cost of the disksholding my database (on GCE, to be clear ).
May I ask the size of the bytea data (uncompressed) ?.
single records vary from 150k to 80 MB, the grand total is more than8,5 TB in a circa 10 TB data footprint
This data is not accessed frequently (just by the owner when heneeds to do it), so no need to keep it on expensive hardware.I've already read in these years that keeping many big bytea fieldsin databases is not recommended, but might have misunderstood this.
Ok, I assume those are unimportant data, but let me ask, what is thelongevity or expected legitimacy of those ? I haven't worked withthose just reading :
https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*
Absolutely not, this is *not* unimportant data, and we are usingStandard Storage, for 0,02$/GB/month + operations, that compared to a0.17$/GB/month of an SSD or even more for the Hyperdisks we are using,is a good price drop.

How about hosting your data in your own storage and spend 0$/GB/month ?

**
Another way would have been to move these tables to a differenttablespace, in cheaper storage, but it still would have been 3 timesthe buckets cost.
can you actually mount those Cloud Storage Buckets under a supportedFS in linux and just move them to tablespaces backed by this storage ?
Never tried, I mounted this via FUSE and had some simple operations inthe past, but not sure it can handle database operations in terms ofI/O bandwidth
Why are you considering to get data back to database tables?
Because now if we need to migrate from cloud to on-premise, or justupgrade or move the specific server which holds those data I willhave an extra headache. Also this is a single point of failure, orbest case a cause for fragmented technology introduced just for thesake of keeping things out of the DB.
This is managed as an hierarchical disk structure, so the callingserver may be literally everywhere, it just needs an account (or aservice account) to get in there ,

and you are locked in a proprietary solution. and at their mercy of anyfuture increases in cost.

The problem is: this is generating BIG table bloat, as you mayimagine.Running a VACUUM FULL on an ex-22GB table on a standalone testserver is almost immediate.If I had only one server, I'll process a table a time, with anightly script, and issue a VACUUM FULL to tables that havealready been processed.
But I'm in a logical replication architecture (we are using amultimaster system called pgEdge, but I don't think it will makebig difference, since it's based on logical replication), and I'mbuilding a test cluster.
So you use PgEdge , but you wanna lose all the benefits ofmulti-master , since your binary data won't be replicated ...
I don't think I need it to be replicated, since this data cannot be"edited", so either it's there or it's been deleted. Buckets haveprotections for data deletions or events like ransomware attacks andsuch.Also multi-master was an absolute requirement one year ago becauseof a project we were building, but it has been abandoned and now asimple logical replication would be enough, but let's do one thing atime.
Multi-master is cool, you can configure your pooler / clients to takeadvantage of this for full load balanced architecture, but if not astrict requirement , you can live without it, as so many of us, andemploy other means of load balancing the reads.
That's what we are doing, it's a really cool feature, but Iexperienced (maybe because it uses old pglogical extension) that thereplication is a bit fragile, especially when dealing with those byteafields (when I ingest big loads, say 25-30 GB or more), it happened tobreak replication, and recreating a replica from scratch with "normalsize" tables is not a big deal, since it can be achievedautomatically, because they normally fit in shared memory and can betransferred by the replicator, but you can imagine what would be theeffort and the downtime necessary to create a base backup, transfer itto the replica, build the DB and restart a 10-TB database (ATM we arerunning with a 2-node cluster).

Break this in batches, use modern techniques for robust data loading, insmaller transactions, if you have to.

I've been instructed to issue VACUUM FULL on both nodes, nightly,but before proceeding I read on docs that VACUUM FULL can disruptlogical replication, so I'm a bit concerned on how to proceed.Rows are cleared one a time (one transaction, one row, to keeperrors to the record that issued them)
Mind if you shared the specific doc ?
Obviously I can't find it from a quick search, I'll search deeper, Idon't think it went off a dream :-).
PgEdge is based on the old pg_logical, the old 2ndQuadrantextension, not the native logical replication we have since pgsql10. But I might be mistaken.
Don't know about this, it keeps running on latest pg versions (weare about to upgrade to 17.4, if I'm not wrong), but I'll ask
I read about extensions like pg_squeeze, but I wonder if they arestill not dangerous for replication.
What's pgEdge take on that, I mean the bytea thing you are tryingto achieve here.
They are positive, it's they that suggested to do VACUUM FULL onboth nodes... I'm quite new to replication, so I'm searching someadvise here.
As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=native logical replication. You may look here :
https://github.com/pgEdge/spock
If multi-master is not a must you could convert to vanillapostgresql and focus on standard physical and logical replication.
No, multimaster is cool, but as I said, the project has beendiscontinued and it's not a must anymore. This is the first step,actually. We are planning to return to plain PostgreSQL, or CloudSQLfor PostgreSQL, using logical replication (that seems the mostreliable of the two). We created a test case for both the options, andthey seem to be OK for now, even if I have still to do adequate stresstests. And when I'll do the migration, I'd like to be migrating plaindata only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation asyou understand.

Personally I like to let the DB take care of the data, and I take careof the DB, not a plethora of extra systems that we need to keepconnected and consistent.

Thanks for your help.
Moreno.-

Re: Logical replication, need to reclaim big disk space

Reply via email to