I'd support changing the behavior if we still have a way to match the
intent, which is to return true if the table exists in Hive and is an
Iceberg table.

On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hm I think the thread got a bit sidetracked by the other question.
>
> The initial proposal by Steve is a performance improvement for
> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
> table metadata, and if successful returns true.  The proposal is to load
> from Hive only, and return true if Hive metadata identifies that an Iceberg
> table exists with this name.
>
> Checking corruption of Iceberg's table metadata.json is a side-effect of
> the current behavior, but would not anymore with the proposed change.
> That's the question of the original thread, and so far there's agreement
> that it is not necessarily part of this scope of HiveCatalog's
> tableExists().
>
> At least this is my understanding.
> Thanks,
> Szehon
>
> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com>
> wrote:
>
>> What kind of corruption are you referring to? I would expect corruption
>> to result in an exception when loading the table, but that the table should
>> still exist. The problem is likely that we determine if a table exists by
>> attempting to load it. We could fix that by not attempting to load the
>> table. I think that's a reasonable solution.
>>
>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> The current behavior's intent is not to check whether the metadata is
>>>> valid, it is to detect whether the table is an Iceberg table.
>>>
>>>
>>> Is there a way to detect this from HiveCatalog without loading the
>>> table?
>>>
>>>
>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> I think we have an agreement, not to change the behavior wrt existing
>>>> non-Iceberg tables, and throw an exception.
>>>>
>>>> Are we also in agreement with the original proposal to return true when
>>>> the table exists but the metadata is somehow corrupted? Note: this is the
>>>> proposed change of behavior why the thread was originally started.
>>>>
>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:
>>>>
>>>>> I'd argue against changing this. The current behavior's intent is not
>>>>> to check whether the metadata is valid, it is to detect whether the table
>>>>> is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>>>>> would be surprising, especially if we started throwing exceptions.
>>>>>
>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> > Should add, my personal preference is probably not to change the
>>>>>> existing behavior for this part
>>>>>>
>>>>>> +1. I realized that this is not a new behavior. The `loadTable`
>>>>>> implementation has this problem too.
>>>>>> It would be good to have a test case specifically for this edge case
>>>>>> and maybe call this out in the documentation.
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin Liu
>>>>>>
>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Should add, my personal preference is probably not to change the
>>>>>>> existing behavior for this part (false, if exists a Hive table with same
>>>>>>> name) at the moment, just adding another possibility for consideration.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess
>>>>>>>> a third option instead of returning true/false in this case, is to 
>>>>>>>> change
>>>>>>>> it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
>>>>>>>> which I think is actually what this pr does?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> I think what Kevin says makes sense. However, it would then
>>>>>>>>> confuse the opposite use case of this function. Let's assume that we 
>>>>>>>>> change
>>>>>>>>> the implementation of tableExists() to not load the table internally:
>>>>>>>>>
>>>>>>>>> if (tableExists(table_name)) {
>>>>>>>>>     table = loadTable(table_name);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Here, you find that the table exists but when you try to load it
>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any of 
>>>>>>>>> these
>>>>>>>>> 2 are intuitive. I think the question here is how much an API of the
>>>>>>>>> Iceberg table format should know about the existence of tables in 
>>>>>>>>> other
>>>>>>>>> formats.
>>>>>>>>>
>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>>>> HMS
>>>>>>>>>
>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an
>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not 
>>>>>>>>> sure if
>>>>>>>>> any of the 2 approaches are better than the other, I just wanted to 
>>>>>>>>> show
>>>>>>>>> that there is another side of the coin :)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Gabor
>>>>>>>>>
>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Steve,
>>>>>>>>>>
>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>>>>
>>>>>>>>>> I do have a question about usage, though. Typically, I would use
>>>>>>>>>> `tableExists` like this:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>>>     table = createTable(table_name);
>>>>>>>>>> }
>>>>>>>>>> ```
>>>>>>>>>> What happens when a Hive table with the same name already exists
>>>>>>>>>> in the catalog? In the current implementation, `tableExists` would 
>>>>>>>>>> return
>>>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>>>>>> `NoSuchTableException`.
>>>>>>>>>> This would cause the code above to attempt to create the table,
>>>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>>>> HMS, perhaps it should return true even when a Hive table with the 
>>>>>>>>>> same
>>>>>>>>>> name exists.
>>>>>>>>>>
>>>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kevin Liu
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <
>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> It's a good performance find and improvement.   Left some
>>>>>>>>>>> comment on the PR.
>>>>>>>>>>>
>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>>>>>> supportive of it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>>>
>>>>>>>>>>>>   I have a proposal to simplify the tableExists API in the
>>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to 
>>>>>>>>>>>> hear your
>>>>>>>>>>>> thoughts.
>>>>>>>>>>>>
>>>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>>>>> It returns true if the table can be loaded without exceptions.
>>>>>>>>>>>> This behavior implies two checks:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>>>>>
>>>>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>>>>>> separates
>>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata 
>>>>>>>>>>>> not
>>>>>>>>>>>> corrupted). Such a change could improve the performance of 
>>>>>>>>>>>> existence
>>>>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>>>>> abstracted as
>>>>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>>>>
>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in
>>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not 
>>>>>>>>>>>> have any
>>>>>>>>>>>> negative impact.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>>>
>>>>>>>>>>>> [2]:
>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Steve Zhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to