rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805401777
##########
File path:
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident,
TableChange... changes) throws No
@Override
public boolean dropTable(Identifier ident) {
+ return dropTableInternal(ident, false);
+ }
+
+ @Override
+ public boolean purgeTable(Identifier ident) {
+ return dropTableInternal(ident, true);
+ }
+
+ private boolean dropTableInternal(Identifier ident, boolean purge) {
try {
return isPathIdentifier(ident) ?
- tables.dropTable(((PathIdentifier) ident).location()) :
- icebergCatalog.dropTable(buildIdentifier(ident));
+ tables.dropTable(((PathIdentifier) ident).location(), purge) :
Review comment:
> 100% Agree. However we now opened the door by adding registerTable in
catalog, which maps to the external table concept perfectly. I already received
a few feature requests of people asking for this to map to CREATE EXTERNAL
TABLE. People can now register an Iceberg table with an old metadata file
location and do writes against it to create basically 2 diverged metadata
history of the same table. This is very dangerous action because 2 Iceberg
tables can now own the same set of files and corrupt each other.
I'm not sure I agree that it maps perfectly. This is a way to register a
table with a catalog, after which the catalog owns it like any other table.
There should be nothing that suggests registration has anything to do with
`EXTERNAL` and no reason for people to think that tables that are added to a
catalog through `registerTable` should behave any differently.
If this confusion persists, I would support removing `registerTable` from
the API.
> Just from correctness perspective, this is the wrong thing to promote.
Agreed!
> In the long term, we should start to promote a new table ownership model
(maybe call it a SHARED model) and start to bring people up to date with how
Iceberg tables are operated. Let me draft a doc for that to have a formal
discussion, and also include concepts like table root location ownership in
that doc so we can have full clarity in the domain of table ownership.
I'm not sure that I would want a `SHARED` keyword -- that just implies there
are times when the table is not shared and we would get into similar trouble.
But I think your idea to address this in a design doc is good.
Also, I consider the data/file ownership a separate problem, so you may want
to keep them separate in design docs or proposals. I wouldn't want to confuse
table modification with data file ownership, although modification does have
implications for file ownership.
> I think if we change the behavior of drop table to not drop any data that
alleviates our concern on accidental drops on external tables. However, it also
means that drop table on managed tables would leave data around, which is also
an issue.
This is why Iceberg ignores `EXTERNAL`. The platform should be making these
decisions, ideally. Users interact with logical tables, physical concerns are
for the platform. If you don't have a platform-level plan for dropping table
data, then I think the `PURGE` approach is okay because a user presumably makes
the choice at the right time (rather than months if not years before the drop).
My general recommendation is to tell users that they're logically dropping
tables and data. Maybe you can have a platform-supported way to un-delete, but
when you drop a table you generally have no expectation that you didn't do
anything destructive!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]