Thanks Kevin and Gabor, this is an interesting discussion. I guess a third option instead of returning true/false in this case, is to change it to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I think is actually what this pr does?
Thanks Szehon On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab <gaborkas...@cloudera.com.invalid> wrote: > Hey, > > I think what Kevin says makes sense. However, it would then confuse the > opposite use case of this function. Let's assume that we change the > implementation of tableExists() to not load the table internally: > > if (tableExists(table_name)) { > table = loadTable(table_name); > } > > Here, you find that the table exists but when you try to load it it fails > because it's not an Iceberg table. I don't think that any of these 2 are > intuitive. I think the question here is how much an API of the Iceberg > table format should know about the existence of tables in other formats. > > If `tableExists` is meant to check for conflicting entries in the HMS > > Another interpretation of calling Catalog.tableExists() on an Iceberg API > is instead "is there such an Iceberg table". TBH, not sure if any of the 2 > approaches are better than the other, I just wanted to show that there is > another side of the coin :) > > Regards, > Gabor > > On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> wrote: > >> Hi Steve, >> >> This makes sense to me. The semantics of `tableExists` focus on whether a >> table's name exists in the catalog. For the Hive catalog, checking the HMS >> entry should be sufficient. >> >> I do have a question about usage, though. Typically, I would use ` >> tableExists` like this: >> >> ``` >> if (!tableExists(table_name)) { >> table = createTable(table_name); >> } >> ``` >> What happens when a Hive table with the same name already exists in the >> catalog? In the current implementation, `tableExists` would return `false` >> because `HiveOperationsBase.validateTableIsIceberg` throws a >> `NoSuchTableException`. >> This would cause the code above to attempt to create the table, only to >> fail since the name already exists in the HMS. >> If `tableExists` is meant to check for conflicting entries in the HMS, >> perhaps it should return true even when a Hive table with the same name >> exists. >> >> I’d love to hear your thoughts on this. >> >> Best, >> Kevin Liu >> >> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> Hi, >>> >>> It's a good performance find and improvement. Left some comment on the >>> PR. >>> >>> IMO, the behavior actually more matches the API javadoc ("Check whether >>> table exists"), not whether it is corrupted or not, so I'm supportive of it. >>> >>> Thanks >>> Szehon >>> >>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>> <hongyue_zh...@apple.com.invalid> wrote: >>> >>>> Hi Iceberger, >>>> >>>> I have a proposal to simplify the tableExists API in the Hive >>>> catalog, which involves a behavior change, and I’d like to hear your >>>> thoughts. >>>> >>>> Currently, in our catalog interface[1], the tableExists method is >>>> implemented as a default API by invoking the loadTable method. It >>>> returns true if the table can be loaded without exceptions. This >>>> behavior implies two checks: >>>> >>>> 1. The table entry exists in the catalog. >>>> 2. The latest metadata.json for the table is not corrupted. >>>> >>>> The behavior change I’m proposing focuses only on the first >>>> condition—checking if the table entry exists in the catalog. This separates >>>> the concerns of table existence and table health (e.g., metadata not >>>> corrupted). Such a change could improve the performance of existence >>>> checks, especially for RESTcatalog where table existence is abstracted as >>>> an HTTP HEAD request [2]. >>>> >>>> I also reviewed the current usage of the tableExists API in the >>>> Iceberg codebase to ensure that this optimization would not have any >>>> negative impact. >>>> >>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I >>>> can follow up with a similar optimization for the viewExists method in >>>> the Hive catalog. >>>> >>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>> >>>> [2]: >>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>> >>>> >>>> >>>> Best regards, >>>> Steve Zhang >>>> >>>> >>>> >>>>