marton-bod commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806687999


   @aokolnychyi Thanks for your questions, Anton.
   
   > How does Hive populate deleteData in commitDropTable?
   
   `deleteData` is set to true by default here: 
[Hive.java](https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L965).
 This ensures data is automatically deleted for managed tables.
   However, data is not deleted for external tables pre-Hive4: 
[HiveMetaStore.java](https://github.com/apache/hive/blob/branch-3.1/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2543),
 
   Introduced in Hive4, we can now force the deletion even for external tables 
with the `external.table.purge=TRUE` property: 
[Docs](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables)
   
   > Is it correct that external.table.purge is not specific to Iceberg, it 
used in other tables too?
   
   Correct, it's used for Hive tables as well.
   
   So at the moment, all Hive Iceberg tables are external, therefore their data 
is not dropped automatically by Hive2 and Hive3 (Hive4 does so with the purge 
flag). Currently our solution for that is to check for the 
`external.table.purge` property in the `HiveIcebergMetaHook`, and if set, call 
`Catalogs.dropTable/CatalogUtil.dropTableData` to delete the data ourselves in 
the postHook.
   
    A Hive user should be able to decide whether or not data deletion is 
desirable for a table by changing the table property `external.table.purge`. 
Going forward, some of the options I see to avoid surprises:
   
   - Synchronizing `external.table.purge` with `gc.enabled`. When one changes, 
the other is changed accordingly. This could already work when changing the 
`gc.enabled` (and then altering the purge prop in HMS accordingly), but not the 
other way around (since `ALTER TABLE SET TBLPROPERTIES` is not implemented yet 
to push down HMS prop changes to Iceberg - although this is on our roadmap 
already). 
   - Adding the `gc.enabled` prop to the HMS table props as well, make it 
control the data deletion for iceberg tables via the MetaHook (+ update docs to 
let users know of this) and let users change this prop via ALTER TABLE as 
desired. I don't really like this one, because it introduces an inconsistency 
between Iceberg and Hive tables.
   - Leaving things as is, but updating the Hive docs so that users are aware 
that DROP TABLE for Iceberg tables might behave differently depending on 
Iceberg configuration, and improve logging. I'm no big fan of this either.
   
   At first glance, I think I'd prefer the first option. Any thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to