jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805296518
##########
File path:
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident,
TableChange... changes) throws No
@Override
public boolean dropTable(Identifier ident) {
+ return dropTableInternal(ident, false);
+ }
+
+ @Override
+ public boolean purgeTable(Identifier ident) {
+ return dropTableInternal(ident, true);
+ }
+
+ private boolean dropTableInternal(Identifier ident, boolean purge) {
try {
return isPathIdentifier(ident) ?
- tables.dropTable(((PathIdentifier) ident).location()) :
- icebergCatalog.dropTable(buildIdentifier(ident));
+ tables.dropTable(((PathIdentifier) ident).location(), purge) :
Review comment:
Thank you Ryan very much for the insights!
For the discussion related to GC, I am fine with disallowing purge when
`gc.enabled` is set to false as a short term solution if we would like to
define garbage collection that way. We should do it within the catalog
implementations to ensure this behavior is consistent across engines.
> I really don't think that the concept of EXTERNAL has a place in Iceberg
and I am very skeptical that we should add it.
100% Agree. However we now opened the door by adding `registerTable` in
catalog, which maps to the external table concept perfectly. I already received
a few feature requests of people asking for this to map to CREATE EXTERNAL
TABLE. People can now register an Iceberg table with an old metadata file
location and do writes against it to create basically 2 diverged metadata
history of the same table. This is very dangerous action because 2 Iceberg
tables can now own the same set of files and corrupt each other.
As the first step, we should make it clear in the javadoc of `registerTable`
that it's only used to recover a table. Creating 2 references in catalog of the
same table metadata and do write operations on both to create diverged metadata
history is not recommended and will have unintended side effects.
When I was suggesting to add the `EXTERNAL` concept to `registerTable`, to
be honest I was really trying to make peace with people who wants to stick with
this definition. At least we can have a read only solution and only encourage
registering a normal table for recovery. But the more I think of this topic,
the more I feel we should start to promote the right way to operate against
Iceberg tables.
Just from correctness perspective, this is the wrong thing to promote. Even
if people just want to read a historical metadata file, information like table
properties are stale. It is always better to do time travel against the latest
metadata, or run query against a tagged snapshot recorded in the latest
metadata, and have the metadata location automatically updated with new commits.
As you said, `EXTERNAL` is also just arbitrarily limiting the ability for
people to use Iceberg. I would say the only value is at business level. As
compute vendors or data platforms that have such traditional definition of
external and managed tables, it is much easier to provide external table
support for an alien product comparing to full table operation support that is
usually offered to managed tables only. My team underwent such debate for a
long time before we decided to go with full support, and we tried really hard
to explain the entire picture, but I doubt if other people could do the same.
We can already see some vendors now offer Iceberg support under the name of
"external table".
In the long term, we should start to promote a new table ownership model
(maybe call it a `SHARED` model) and start to bring people up to date with how
Iceberg tables are operated. Let me draft a doc for that to have a formal
discussion, and also include concepts like table root location ownership in
that doc so we can have full clarity in the domain of table ownership.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]