I'd support changing the behavior if we still have a way to match the intent, which is to return true if the table exists in Hive and is an Iceberg table.
On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hm I think the thread got a bit sidetracked by the other question. > > The initial proposal by Steve is a performance improvement for > HiveCatalog's tableExists(). Currently it loads both Hive and Iceberg > table metadata, and if successful returns true. The proposal is to load > from Hive only, and return true if Hive metadata identifies that an Iceberg > table exists with this name. > > Checking corruption of Iceberg's table metadata.json is a side-effect of > the current behavior, but would not anymore with the proposed change. > That's the question of the original thread, and so far there's agreement > that it is not necessarily part of this scope of HiveCatalog's > tableExists(). > > At least this is my understanding. > Thanks, > Szehon > > On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com> > wrote: > >> What kind of corruption are you referring to? I would expect corruption >> to result in an exception when loading the table, but that the table should >> still exist. The problem is likely that we determine if a table exists by >> attempting to load it. We could fix that by not attempting to load the >> table. I think that's a reasonable solution. >> >> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com> >> wrote: >> >>> The current behavior's intent is not to check whether the metadata is >>>> valid, it is to detect whether the table is an Iceberg table. >>> >>> >>> Is there a way to detect this from HiveCatalog without loading the >>> table? >>> >>> >>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> I think we have an agreement, not to change the behavior wrt existing >>>> non-Iceberg tables, and throw an exception. >>>> >>>> Are we also in agreement with the original proposal to return true when >>>> the table exists but the metadata is somehow corrupted? Note: this is the >>>> proposed change of behavior why the thread was originally started. >>>> >>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote: >>>> >>>>> I'd argue against changing this. The current behavior's intent is not >>>>> to check whether the metadata is valid, it is to detect whether the table >>>>> is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior >>>>> would be surprising, especially if we started throwing exceptions. >>>>> >>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> >>>>> wrote: >>>>> >>>>>> > Should add, my personal preference is probably not to change the >>>>>> existing behavior for this part >>>>>> >>>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>>> implementation has this problem too. >>>>>> It would be good to have a test case specifically for this edge case >>>>>> and maybe call this out in the documentation. >>>>>> >>>>>> Thanks, >>>>>> Kevin Liu >>>>>> >>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Should add, my personal preference is probably not to change the >>>>>>> existing behavior for this part (false, if exists a Hive table with same >>>>>>> name) at the moment, just adding another possibility for consideration. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion. I guess >>>>>>>> a third option instead of returning true/false in this case, is to >>>>>>>> change >>>>>>>> it to throw an NoSuchIcebergTableException if its a non-Iceberg table, >>>>>>>> which I think is actually what this pr does? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>>>>> >>>>>>>>> Hey, >>>>>>>>> >>>>>>>>> I think what Kevin says makes sense. However, it would then >>>>>>>>> confuse the opposite use case of this function. Let's assume that we >>>>>>>>> change >>>>>>>>> the implementation of tableExists() to not load the table internally: >>>>>>>>> >>>>>>>>> if (tableExists(table_name)) { >>>>>>>>> table = loadTable(table_name); >>>>>>>>> } >>>>>>>>> >>>>>>>>> Here, you find that the table exists but when you try to load it >>>>>>>>> it fails because it's not an Iceberg table. I don't think that any of >>>>>>>>> these >>>>>>>>> 2 are intuitive. I think the question here is how much an API of the >>>>>>>>> Iceberg table format should know about the existence of tables in >>>>>>>>> other >>>>>>>>> formats. >>>>>>>>> >>>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>>>> HMS >>>>>>>>> >>>>>>>>> Another interpretation of calling Catalog.tableExists() on an >>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not >>>>>>>>> sure if >>>>>>>>> any of the 2 approaches are better than the other, I just wanted to >>>>>>>>> show >>>>>>>>> that there is another side of the coin :) >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Gabor >>>>>>>>> >>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Steve, >>>>>>>>>> >>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog, >>>>>>>>>> checking the HMS entry should be sufficient. >>>>>>>>>> >>>>>>>>>> I do have a question about usage, though. Typically, I would use >>>>>>>>>> `tableExists` like this: >>>>>>>>>> >>>>>>>>>> ``` >>>>>>>>>> if (!tableExists(table_name)) { >>>>>>>>>> table = createTable(table_name); >>>>>>>>>> } >>>>>>>>>> ``` >>>>>>>>>> What happens when a Hive table with the same name already exists >>>>>>>>>> in the catalog? In the current implementation, `tableExists` would >>>>>>>>>> return >>>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a >>>>>>>>>> `NoSuchTableException`. >>>>>>>>>> This would cause the code above to attempt to create the table, >>>>>>>>>> only to fail since the name already exists in the HMS. >>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>>>> HMS, perhaps it should return true even when a Hive table with the >>>>>>>>>> same >>>>>>>>>> name exists. >>>>>>>>>> >>>>>>>>>> I’d love to hear your thoughts on this. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Kevin Liu >>>>>>>>>> >>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho < >>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> It's a good performance find and improvement. Left some >>>>>>>>>>> comment on the PR. >>>>>>>>>>> >>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm >>>>>>>>>>> supportive of it. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Szehon >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Iceberger, >>>>>>>>>>>> >>>>>>>>>>>> I have a proposal to simplify the tableExists API in the >>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to >>>>>>>>>>>> hear your >>>>>>>>>>>> thoughts. >>>>>>>>>>>> >>>>>>>>>>>> Currently, in our catalog interface[1], the tableExists method >>>>>>>>>>>> is implemented as a default API by invoking the loadTable method. >>>>>>>>>>>> It returns true if the table can be loaded without exceptions. >>>>>>>>>>>> This behavior implies two checks: >>>>>>>>>>>> >>>>>>>>>>>> 1. The table entry exists in the catalog. >>>>>>>>>>>> 2. The latest metadata.json for the table is not corrupted. >>>>>>>>>>>> >>>>>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>>>>> condition—checking if the table entry exists in the catalog. This >>>>>>>>>>>> separates >>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata >>>>>>>>>>>> not >>>>>>>>>>>> corrupted). Such a change could improve the performance of >>>>>>>>>>>> existence >>>>>>>>>>>> checks, especially for RESTcatalog where table existence is >>>>>>>>>>>> abstracted as >>>>>>>>>>>> an HTTP HEAD request [2]. >>>>>>>>>>>> >>>>>>>>>>>> I also reviewed the current usage of the tableExists API in >>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not >>>>>>>>>>>> have any >>>>>>>>>>>> negative impact. >>>>>>>>>>>> >>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>>>>> >>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>>>>> >>>>>>>>>>>> [2]: >>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Steve Zhang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>