What kind of corruption are you referring to? I would expect corruption to
result in an exception when loading the table, but that the table should
still exist. The problem is likely that we determine if a table exists by
attempting to load it. We could fix that by not attempting to load the
table. I think that's a reasonable solution.

On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com> wrote:

> The current behavior's intent is not to check whether the metadata is
>> valid, it is to detect whether the table is an Iceberg table.
>
>
> Is there a way to detect this from HiveCatalog without loading the table?
>
>
> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> I think we have an agreement, not to change the behavior wrt existing
>> non-Iceberg tables, and throw an exception.
>>
>> Are we also in agreement with the original proposal to return true when
>> the table exists but the metadata is somehow corrupted? Note: this is the
>> proposed change of behavior why the thread was originally started.
>>
>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:
>>
>>> I'd argue against changing this. The current behavior's intent is not to
>>> check whether the metadata is valid, it is to detect whether the table is
>>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>>> would be surprising, especially if we started throwing exceptions.
>>>
>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> wrote:
>>>
>>>> > Should add, my personal preference is probably not to change the
>>>> existing behavior for this part
>>>>
>>>> +1. I realized that this is not a new behavior. The `loadTable`
>>>> implementation has this problem too.
>>>> It would be good to have a test case specifically for this edge case
>>>> and maybe call this out in the documentation.
>>>>
>>>> Thanks,
>>>> Kevin Liu
>>>>
>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Should add, my personal preference is probably not to change the
>>>>> existing behavior for this part (false, if exists a Hive table with same
>>>>> name) at the moment, just adding another possibility for consideration.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>>>>>> third option instead of returning true/false in this case, is to change 
>>>>>> it
>>>>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, 
>>>>>> which I
>>>>>> think is actually what this pr does?
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> I think what Kevin says makes sense. However, it would then confuse
>>>>>>> the opposite use case of this function. Let's assume that we change the
>>>>>>> implementation of tableExists() to not load the table internally:
>>>>>>>
>>>>>>> if (tableExists(table_name)) {
>>>>>>>     table = loadTable(table_name);
>>>>>>> }
>>>>>>>
>>>>>>> Here, you find that the table exists but when you try to load it it
>>>>>>> fails because it's not an Iceberg table. I don't think that any of 
>>>>>>> these 2
>>>>>>> are intuitive. I think the question here is how much an API of the 
>>>>>>> Iceberg
>>>>>>> table format should know about the existence of tables in other formats.
>>>>>>>
>>>>>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>>>>>
>>>>>>> Another interpretation of calling Catalog.tableExists() on an
>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
>>>>>>> if
>>>>>>> any of the 2 approaches are better than the other, I just wanted to show
>>>>>>> that there is another side of the coin :)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>>
>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Steve,
>>>>>>>>
>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>>
>>>>>>>> I do have a question about usage, though. Typically, I would use `
>>>>>>>> tableExists` like this:
>>>>>>>>
>>>>>>>> ```
>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>     table = createTable(table_name);
>>>>>>>> }
>>>>>>>> ```
>>>>>>>> What happens when a Hive table with the same name already exists in
>>>>>>>> the catalog? In the current implementation, `tableExists` would return
>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>>>> `NoSuchTableException`.
>>>>>>>> This would cause the code above to attempt to create the table,
>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>> HMS, perhaps it should return true even when a Hive table with the same
>>>>>>>> name exists.
>>>>>>>>
>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kevin Liu
>>>>>>>>
>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> It's a good performance find and improvement.   Left some comment
>>>>>>>>> on the PR.
>>>>>>>>>
>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>>>> supportive of it.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>
>>>>>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>>>>>> thoughts.
>>>>>>>>>>
>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>>> It returns true if the table can be loaded without exceptions.
>>>>>>>>>> This behavior implies two checks:
>>>>>>>>>>
>>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>>>
>>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>>>> separates
>>>>>>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>>>>>>> corrupted). Such a change could improve the performance of existence
>>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>>> abstracted as
>>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>>
>>>>>>>>>> I also reviewed the current usage of the tableExists API in the
>>>>>>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>>>>>>> negative impact.
>>>>>>>>>>
>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>
>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>
>>>>>>>>>> [2]:
>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Steve Zhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Reply via email to