[GitHub] [iceberg] kbendick commented on issue #3541: Iceberg does not support drop namespace cascade

GitBox Wed, 17 Nov 2021 22:26:36 -0800


kbendick commented on issue #3541:
URL: https://github.com/apache/iceberg/issues/3541#issuecomment-972575604

Ok. Well, I think I've mostly figured it out.

So the contract says that things should `CASCADE` by default. Which is why
it's not exposed to Iceberg or other connectors.

All of the tests that I've found which use `CASCADE` use the
`InMemoryTableCatalog`. The `InMemoryTableCatalog` does in fact list all
namespaces and things and drops them by default.

https://github.com/apache/spark/blob/e99fdf9654481dd9b691a3c10e52f3f3db6ed2ba/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala#L216-L224

It seems to be that the contract is supposed to be that people who implement
`dropNamespace` are supposed to eagerly cascade. I will look into this more (as
its getting late for me and its been a long day), but if there is danger of
corrupting people's data or accidentally allowing people to drop large things
they didn't intend to, then that's probably something that's going to need to
be discussed.

I've talked to a decent number of people at several different large
organizations that use Spark, and it's pretty rare for very large organizations
to even let their users actually purge data. As often times, users want that
data back within a few days or so or they didn't intend to delete it at all.

So that at least accounts for how the existing tests which have `cascade` in
them work. Because of that `InMemoryTableCatalog`, which does list and drop
everything. It seems that for performance reasons, they've offloaded the work
of listing and dropping from spark to the individual catalog implementations.

So we'll see. If it's possible to have `cascade` and still be safe for the
underlying data - meaning that the `RESTRICT` keyword isn't required, then I'm
definitely very much in favor of it.

It's too bad they don't allow for seeing if the flag is set or not, as
listing is also expensive for us if each individual call is going to wind up
getting ultimately rejected.

But that `InMemoryTableCatalog` should provide some hints on how they have
cascade working.

The reason that I worry it might not be safe for us, is that for our
implementation of `HiveCatalog`, we have to pass `cascade` directly to the Hive
Metastore Client:
https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L271-L284

So I don't think that would go through Spark, which assumes it has control
over it.

Again, I'd love to be wrong. So if you can come up with a PR that works and
doesn't require people to use `RESTRICT` and doesn't simply always cascade,
regardless of what the user puts, I'd 100% sign off on that.

I'll also try to find time to play around and see if I can get it to work
that way. Hopefully it's something that we can make work, but Spark does seem
to assume that all of the interactions with Hive go through it and that's not
necessarily true for us as far as I can tell because of our implementation of
`HiveCatalog` (as we interact with a large number of systems, not just Spark).

This would all be a lot easier if Spark exposed whether or not the user had
entered `cascade` into the query for us. But if we can make this work, then I'd
love to get this fixed. 🙂

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on issue #3541: Iceberg does not support drop namespace cascade

Reply via email to