On Tue, Feb 04, 2020 at 12:53:44AM +0000, tsunakawa.ta...@fujitsu.com
wrote:
From: Tomas Vondra <tomas.von...@2ndquadrant.com>
That's not really what I meant - let me explain. When I said DROP
TABLE should do everything as usual, that includes catalog changes.
I.e. after the commit there would not be any remaining entries in
system catalogs or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode
from the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

That sounds good.  I think we can also follow the way the WAL archiver
does its job, instead of using a regular table.  That is, when the
transaction that performed DROP TABLE commits, it puts the data files
in the "trash bin," which is actually a filesystem directory.  Or, it
just renames the data files in the original directory by appending some
suffix such as ".del".  Then, the background worker scans the trash bin
or the data directory to erase the file content and delete the file.


Yeah, that could work, I guess.

The trash bin mechanism may open up the application for restoring
mistakenly dropped tables, a feature like Oracle's Flash Drop.  The
dropping transaction puts the table metadata (system catalog data or
DDL) in the trash bin as well as the data file.


That seems like a very different feature, and I doubt this is the right
way to implement that. That would require much more infrastructure than
just moving the file to a separate dir.


I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough,
I guess, but maybe not.

Agreed, because the user should expect the disk space to be available
after DROP TABLE has been committed.  Can't we really make the COMMIT
to wait for the erasure to complete?  Do we have to use an asynchronous
erasure method with a background worker?  For example, COMMIT performs:


I think it depends how exactly it's implemented. As Tom pointed out in
his message [1], we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

[1] https://www.postgresql.org/message-id/9104.1579107235%40sss.pgh.pa.us

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.


I don't think the COMMIT has to wait for (3) - it might, of course, but
for some use cases it may be better to just commit and leave the
bgworker do the work. And then allow checking if it completed.

What is concerned about is that the need to erase and delete the data
file would be forgotten if the server crashes during step 3.  If so,
postmaster can do the job at startup, just like it deletes temporary
files (although it delays the startup.)


Startup seems like a pretty bad place to do this stuff. There may be a
lot of data to erase, making recovery very long.


I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on
the primary, you probably also want to erase the data on the replica.
But if the instances use different file systems (COW vs. non-COW,
...) the exact thing that needs to happen may be different. Or maybe
the replica does not need to do anything, making it noop?

We can guide the use of non-COW file systems on both the primary and
standby in the manual.


I don't see how that solves the issue. I think it's quite useful to be
able to use different filesystems for primary/replica. And we may even
know how to securely erase data on both, in which case I don't see a
point not to allow such configurations.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to