kbendick commented on issue #3541: URL: https://github.com/apache/iceberg/issues/3541#issuecomment-972575604
Ok. Well, I think I've mostly figured it out. So the contract says that things should `CASCADE` by default. Which is why it's not exposed to Iceberg or other connectors. All of the tests that I've found which use `CASCADE` use the `InMemoryTableCatalog`. The `InMemoryTableCatalog` does in fact list all namespaces and things and drops them by default. https://github.com/apache/spark/blob/e99fdf9654481dd9b691a3c10e52f3f3db6ed2ba/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala#L216-L224 It seems to be that the contract is supposed to be that people who implement `dropNamespace` are supposed to eagerly cascade. I will look into this more (as its getting late for me and its been a long day), but if there is danger of corrupting people's data or accidentally allowing people to drop large things they didn't intend to, then that's probably something that's going to need to be discussed. I've talked to a decent number of people at several different large organizations that use Spark, and it's pretty rare for very large organizations to even let their users actually purge data. As often times, users want that data back within a few days or so or they didn't intend to delete it at all. So that at least accounts for how the existing tests which have `cascade` in them work. Because of that `InMemoryTableCatalog`, which does list and drop everything. It seems that for performance reasons, they've offloaded the work of listing and dropping from spark to the individual catalog implementations. So we'll see. If it's possible to have `cascade` and still be safe for the underlying data - meaning that the `RESTRICT` keyword isn't required, then I'm definitely very much in favor of it. It's too bad they don't allow for seeing if the flag is set or not, as listing is also expensive for us if each individual call is going to wind up getting ultimately rejected. But that `InMemoryTableCatalog` should provide some hints on how they have cascade working. The reason that I worry it might not be safe for us, is that for our implementation of `HiveCatalog`, we have to pass `cascade` directly to the Hive Metastore Client: https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L271-L284 So I don't think that would go through Spark, which assumes it has control over it. Again, I'd love to be wrong. So if you can come up with a PR that works and doesn't require people to use `RESTRICT` and doesn't simply always cascade, regardless of what the user puts, I'd 100% sign off on that. I'll also try to find time to play around and see if I can get it to work that way. Hopefully it's something that we can make work, but Spark does seem to assume that all of the interactions with Hive go through it and that's not necessarily true for us as far as I can tell because of our implementation of `HiveCatalog` (as we interact with a large number of systems, not just Spark). This would all be a lot easier if Spark exposed whether or not the user had entered `cascade` into the query for us. But if we can make this work, then I'd love to get this fixed. 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
