marton-bod commented on pull request #2367: URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806687999
@aokolnychyi Thanks for your questions, Anton. > How does Hive populate deleteData in commitDropTable? `deleteData` is set to true by default here: [Hive.java](https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L965). This ensures data is automatically deleted for managed tables. However, data is not deleted for external tables pre-Hive4: [HiveMetaStore.java](https://github.com/apache/hive/blob/branch-3.1/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2543), Introduced in Hive4, we can now force the deletion even for external tables with the `external.table.purge=TRUE` property: [Docs](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables) > Is it correct that external.table.purge is not specific to Iceberg, it used in other tables too? Correct, it's used for Hive tables as well. So at the moment, all Hive Iceberg tables are external, therefore their data is not dropped automatically by Hive2 and Hive3 (Hive4 does so with the purge flag). Currently our solution for that is to check for the `external.table.purge` property in the `HiveIcebergMetaHook`, and if set, call `Catalogs.dropTable/CatalogUtil.dropTableData` to delete the data ourselves in the postHook. A Hive user should be able to decide whether or not data deletion is desirable for a table by changing the table property `external.table.purge`. Going forward, some of the options I see to avoid surprises: - Synchronizing `external.table.purge` with `gc.enabled`. When one changes, the other is changed accordingly. This could already work when changing the `gc.enabled` (and then altering the purge prop in HMS accordingly), but not the other way around (since `ALTER TABLE SET TBLPROPERTIES` is not implemented yet to push down HMS prop changes to Iceberg - although this is on our roadmap already). - Adding the `gc.enabled` prop to the HMS table props as well, make it control the data deletion for iceberg tables via the MetaHook (+ update docs to let users know of this) and let users change this prop via ALTER TABLE as desired. I don't really like this one, because it introduces an inconsistency between Iceberg and Hive tables. - Leaving things as is, but updating the Hive docs so that users are aware that DROP TABLE for Iceberg tables might behave differently depending on Iceberg configuration, and improve logging. I'm no big fan of this either. At first glance, I think I'd prefer the first option. Any thoughts on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
