What kind of corruption are you referring to? I would expect corruption to result in an exception when loading the table, but that the table should still exist. The problem is likely that we determine if a table exists by attempting to load it. We could fix that by not attempting to load the table. I think that's a reasonable solution.
On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > The current behavior's intent is not to check whether the metadata is >> valid, it is to detect whether the table is an Iceberg table. > > > Is there a way to detect this from HiveCatalog without loading the table? > > > On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> I think we have an agreement, not to change the behavior wrt existing >> non-Iceberg tables, and throw an exception. >> >> Are we also in agreement with the original proposal to return true when >> the table exists but the metadata is somehow corrupted? Note: this is the >> proposed change of behavior why the thread was originally started. >> >> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote: >> >>> I'd argue against changing this. The current behavior's intent is not to >>> check whether the metadata is valid, it is to detect whether the table is >>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior >>> would be surprising, especially if we started throwing exceptions. >>> >>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> wrote: >>> >>>> > Should add, my personal preference is probably not to change the >>>> existing behavior for this part >>>> >>>> +1. I realized that this is not a new behavior. The `loadTable` >>>> implementation has this problem too. >>>> It would be good to have a test case specifically for this edge case >>>> and maybe call this out in the documentation. >>>> >>>> Thanks, >>>> Kevin Liu >>>> >>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Should add, my personal preference is probably not to change the >>>>> existing behavior for this part (false, if exists a Hive table with same >>>>> name) at the moment, just adding another possibility for consideration. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Kevin and Gabor, this is an interesting discussion. I guess a >>>>>> third option instead of returning true/false in this case, is to change >>>>>> it >>>>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, >>>>>> which I >>>>>> think is actually what this pr does? >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> I think what Kevin says makes sense. However, it would then confuse >>>>>>> the opposite use case of this function. Let's assume that we change the >>>>>>> implementation of tableExists() to not load the table internally: >>>>>>> >>>>>>> if (tableExists(table_name)) { >>>>>>> table = loadTable(table_name); >>>>>>> } >>>>>>> >>>>>>> Here, you find that the table exists but when you try to load it it >>>>>>> fails because it's not an Iceberg table. I don't think that any of >>>>>>> these 2 >>>>>>> are intuitive. I think the question here is how much an API of the >>>>>>> Iceberg >>>>>>> table format should know about the existence of tables in other formats. >>>>>>> >>>>>>> If `tableExists` is meant to check for conflicting entries in the HMS >>>>>>> >>>>>>> Another interpretation of calling Catalog.tableExists() on an >>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not sure >>>>>>> if >>>>>>> any of the 2 approaches are better than the other, I just wanted to show >>>>>>> that there is another side of the coin :) >>>>>>> >>>>>>> Regards, >>>>>>> Gabor >>>>>>> >>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Steve, >>>>>>>> >>>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>>> whether a table's name exists in the catalog. For the Hive catalog, >>>>>>>> checking the HMS entry should be sufficient. >>>>>>>> >>>>>>>> I do have a question about usage, though. Typically, I would use ` >>>>>>>> tableExists` like this: >>>>>>>> >>>>>>>> ``` >>>>>>>> if (!tableExists(table_name)) { >>>>>>>> table = createTable(table_name); >>>>>>>> } >>>>>>>> ``` >>>>>>>> What happens when a Hive table with the same name already exists in >>>>>>>> the catalog? In the current implementation, `tableExists` would return >>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a >>>>>>>> `NoSuchTableException`. >>>>>>>> This would cause the code above to attempt to create the table, >>>>>>>> only to fail since the name already exists in the HMS. >>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>> HMS, perhaps it should return true even when a Hive table with the same >>>>>>>> name exists. >>>>>>>> >>>>>>>> I’d love to hear your thoughts on this. >>>>>>>> >>>>>>>> Best, >>>>>>>> Kevin Liu >>>>>>>> >>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> It's a good performance find and improvement. Left some comment >>>>>>>>> on the PR. >>>>>>>>> >>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm >>>>>>>>> supportive of it. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi Iceberger, >>>>>>>>>> >>>>>>>>>> I have a proposal to simplify the tableExists API in the Hive >>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your >>>>>>>>>> thoughts. >>>>>>>>>> >>>>>>>>>> Currently, in our catalog interface[1], the tableExists method >>>>>>>>>> is implemented as a default API by invoking the loadTable method. >>>>>>>>>> It returns true if the table can be loaded without exceptions. >>>>>>>>>> This behavior implies two checks: >>>>>>>>>> >>>>>>>>>> 1. The table entry exists in the catalog. >>>>>>>>>> 2. The latest metadata.json for the table is not corrupted. >>>>>>>>>> >>>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>>> condition—checking if the table entry exists in the catalog. This >>>>>>>>>> separates >>>>>>>>>> the concerns of table existence and table health (e.g., metadata not >>>>>>>>>> corrupted). Such a change could improve the performance of existence >>>>>>>>>> checks, especially for RESTcatalog where table existence is >>>>>>>>>> abstracted as >>>>>>>>>> an HTTP HEAD request [2]. >>>>>>>>>> >>>>>>>>>> I also reviewed the current usage of the tableExists API in the >>>>>>>>>> Iceberg codebase to ensure that this optimization would not have any >>>>>>>>>> negative impact. >>>>>>>>>> >>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>>> >>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>>> >>>>>>>>>> [2]: >>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> Steve Zhang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>