Yea, I think that part is definitely kept. Thanks Szehon
On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > I'd support changing the behavior if we still have a way to match the > intent, which is to return true if the table exists in Hive and is an > Iceberg table. > > On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hm I think the thread got a bit sidetracked by the other question. >> >> The initial proposal by Steve is a performance improvement for >> HiveCatalog's tableExists(). Currently it loads both Hive and Iceberg >> table metadata, and if successful returns true. The proposal is to load >> from Hive only, and return true if Hive metadata identifies that an Iceberg >> table exists with this name. >> >> Checking corruption of Iceberg's table metadata.json is a side-effect of >> the current behavior, but would not anymore with the proposed change. >> That's the question of the original thread, and so far there's agreement >> that it is not necessarily part of this scope of HiveCatalog's >> tableExists(). >> >> At least this is my understanding. >> Thanks, >> Szehon >> >> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com> >> wrote: >> >>> What kind of corruption are you referring to? I would expect corruption >>> to result in an exception when loading the table, but that the table should >>> still exist. The problem is likely that we determine if a table exists by >>> attempting to load it. We could fix that by not attempting to load the >>> table. I think that's a reasonable solution. >>> >>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com> >>> wrote: >>> >>>> The current behavior's intent is not to check whether the metadata is >>>>> valid, it is to detect whether the table is an Iceberg table. >>>> >>>> >>>> Is there a way to detect this from HiveCatalog without loading the >>>> table? >>>> >>>> >>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> I think we have an agreement, not to change the behavior wrt existing >>>>> non-Iceberg tables, and throw an exception. >>>>> >>>>> Are we also in agreement with the original proposal to return true >>>>> when the table exists but the metadata is somehow corrupted? Note: this is >>>>> the proposed change of behavior why the thread was originally started. >>>>> >>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote: >>>>> >>>>>> I'd argue against changing this. The current behavior's intent is not >>>>>> to check whether the metadata is valid, it is to detect whether the table >>>>>> is an Iceberg table. It ignores non-Iceberg tables. Changing that >>>>>> behavior >>>>>> would be surprising, especially if we started throwing exceptions. >>>>>> >>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> > Should add, my personal preference is probably not to change the >>>>>>> existing behavior for this part >>>>>>> >>>>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>>>> implementation has this problem too. >>>>>>> It would be good to have a test case specifically for this edge case >>>>>>> and maybe call this out in the documentation. >>>>>>> >>>>>>> Thanks, >>>>>>> Kevin Liu >>>>>>> >>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Should add, my personal preference is probably not to change the >>>>>>>> existing behavior for this part (false, if exists a Hive table with >>>>>>>> same >>>>>>>> name) at the moment, just adding another possibility for consideration. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion. I >>>>>>>>> guess a third option instead of returning true/false in this case, is >>>>>>>>> to >>>>>>>>> change it to throw an NoSuchIcebergTableException if its a non-Iceberg >>>>>>>>> table, which I think is actually what this pr does? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hey, >>>>>>>>>> >>>>>>>>>> I think what Kevin says makes sense. However, it would then >>>>>>>>>> confuse the opposite use case of this function. Let's assume that we >>>>>>>>>> change >>>>>>>>>> the implementation of tableExists() to not load the table internally: >>>>>>>>>> >>>>>>>>>> if (tableExists(table_name)) { >>>>>>>>>> table = loadTable(table_name); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Here, you find that the table exists but when you try to load it >>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any >>>>>>>>>> of these >>>>>>>>>> 2 are intuitive. I think the question here is how much an API of the >>>>>>>>>> Iceberg table format should know about the existence of tables in >>>>>>>>>> other >>>>>>>>>> formats. >>>>>>>>>> >>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>>>>> HMS >>>>>>>>>> >>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an >>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not >>>>>>>>>> sure if >>>>>>>>>> any of the 2 approaches are better than the other, I just wanted to >>>>>>>>>> show >>>>>>>>>> that there is another side of the coin :) >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Gabor >>>>>>>>>> >>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Steve, >>>>>>>>>>> >>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog, >>>>>>>>>>> checking the HMS entry should be sufficient. >>>>>>>>>>> >>>>>>>>>>> I do have a question about usage, though. Typically, I would use >>>>>>>>>>> `tableExists` like this: >>>>>>>>>>> >>>>>>>>>>> ``` >>>>>>>>>>> if (!tableExists(table_name)) { >>>>>>>>>>> table = createTable(table_name); >>>>>>>>>>> } >>>>>>>>>>> ``` >>>>>>>>>>> What happens when a Hive table with the same name already exists >>>>>>>>>>> in the catalog? In the current implementation, `tableExists` would >>>>>>>>>>> return >>>>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a >>>>>>>>>>> `NoSuchTableException`. >>>>>>>>>>> This would cause the code above to attempt to create the table, >>>>>>>>>>> only to fail since the name already exists in the HMS. >>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in >>>>>>>>>>> the HMS, perhaps it should return true even when a Hive table with >>>>>>>>>>> the same >>>>>>>>>>> name exists. >>>>>>>>>>> >>>>>>>>>>> I’d love to hear your thoughts on this. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kevin Liu >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho < >>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> It's a good performance find and improvement. Left some >>>>>>>>>>>> comment on the PR. >>>>>>>>>>>> >>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm >>>>>>>>>>>> supportive of it. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Szehon >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Iceberger, >>>>>>>>>>>>> >>>>>>>>>>>>> I have a proposal to simplify the tableExists API in the >>>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to >>>>>>>>>>>>> hear your >>>>>>>>>>>>> thoughts. >>>>>>>>>>>>> >>>>>>>>>>>>> Currently, in our catalog interface[1], the tableExists method >>>>>>>>>>>>> is implemented as a default API by invoking the loadTable method. >>>>>>>>>>>>> It returns true if the table can be loaded without >>>>>>>>>>>>> exceptions. This behavior implies two checks: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. The table entry exists in the catalog. >>>>>>>>>>>>> 2. The latest metadata.json for the table is not corrupted. >>>>>>>>>>>>> >>>>>>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. This >>>>>>>>>>>>> separates >>>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata >>>>>>>>>>>>> not >>>>>>>>>>>>> corrupted). Such a change could improve the performance of >>>>>>>>>>>>> existence >>>>>>>>>>>>> checks, especially for RESTcatalog where table existence is >>>>>>>>>>>>> abstracted as >>>>>>>>>>>>> an HTTP HEAD request [2]. >>>>>>>>>>>>> >>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in >>>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not >>>>>>>>>>>>> have any >>>>>>>>>>>>> negative impact. >>>>>>>>>>>>> >>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>>>>>> >>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>>>>>> >>>>>>>>>>>>> [2]: >>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Steve Zhang >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>