kbendick commented on issue #3541:
URL: https://github.com/apache/iceberg/issues/3541#issuecomment-972575604


   Ok. Well, I think I've mostly figured it out.
   
   So the contract says that things should `CASCADE` by default. Which is why 
it's not exposed to Iceberg or other connectors.
   
   All of the tests that I've found which use `CASCADE` use the 
`InMemoryTableCatalog`. The `InMemoryTableCatalog` does in fact list all 
namespaces and things and drops them by default.
   
https://github.com/apache/spark/blob/e99fdf9654481dd9b691a3c10e52f3f3db6ed2ba/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala#L216-L224
   
   It seems to be that the contract is supposed to be that people who implement 
`dropNamespace` are supposed to eagerly cascade. I will look into this more (as 
its getting late for me and its been a long day), but if there is danger of 
corrupting people's data or accidentally allowing people to drop large things 
they didn't intend to, then that's probably something that's going to need to 
be discussed.
   
   I've talked to a decent number of people at several different large 
organizations that use Spark, and it's pretty rare for very large organizations 
to even let their users actually purge data. As often times, users want that 
data back within a few days or so or they didn't intend to delete it at all.
   
   So that at least accounts for how the existing tests which have `cascade` in 
them work. Because of that `InMemoryTableCatalog`, which does list and drop 
everything. It seems that for performance reasons, they've offloaded the work 
of listing and dropping from spark to the individual catalog implementations.
   
   So we'll see. If it's possible to have `cascade` and still be safe for the 
underlying data - meaning that the `RESTRICT` keyword isn't required, then I'm 
definitely very much in favor of it.
   
   It's too bad they don't allow for seeing if the flag is set or not, as 
listing is also expensive for us if each individual call is going to wind up 
getting ultimately rejected.
   
   But that `InMemoryTableCatalog` should provide some hints on how they have 
cascade working.
   
   The reason that I worry it might not be safe for us, is that for our 
implementation of `HiveCatalog`, we have to pass `cascade` directly to the Hive 
Metastore Client: 
https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L271-L284
   
   So I don't think that would go through Spark, which assumes it has control 
over it.
   
   Again, I'd love to be wrong. So if you can come up with a PR that works and 
doesn't require people to use `RESTRICT` and doesn't simply always cascade, 
regardless of what the user puts, I'd 100% sign off on that.
   
   I'll also try to find time to play around and see if I can get it to work 
that way. Hopefully it's something that we can make work, but Spark does seem 
to assume that all of the interactions with Hive go through it and that's not 
necessarily true for us as far as I can tell because of our implementation of 
`HiveCatalog` (as we interact with a large number of systems, not just Spark).
   
   This would all be a lot easier if Spark exposed whether or not the user had 
entered `cascade` into the query for us. But if we can make this work, then I'd 
love to get this fixed. 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to