Yea, I think that part is definitely kept.

Thanks
Szehon

On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <rdb...@gmail.com> wrote:

> I'd support changing the behavior if we still have a way to match the
> intent, which is to return true if the table exists in Hive and is an
> Iceberg table.
>
> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
>
>> Hm I think the thread got a bit sidetracked by the other question.
>>
>> The initial proposal by Steve is a performance improvement for
>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>> table metadata, and if successful returns true.  The proposal is to load
>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>> table exists with this name.
>>
>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>> the current behavior, but would not anymore with the proposed change.
>> That's the question of the original thread, and so far there's agreement
>> that it is not necessarily part of this scope of HiveCatalog's
>> tableExists().
>>
>> At least this is my understanding.
>> Thanks,
>> Szehon
>>
>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com>
>> wrote:
>>
>>> What kind of corruption are you referring to? I would expect corruption
>>> to result in an exception when loading the table, but that the table should
>>> still exist. The problem is likely that we determine if a table exists by
>>> attempting to load it. We could fix that by not attempting to load the
>>> table. I think that's a reasonable solution.
>>>
>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> The current behavior's intent is not to check whether the metadata is
>>>>> valid, it is to detect whether the table is an Iceberg table.
>>>>
>>>>
>>>> Is there a way to detect this from HiveCatalog without loading the
>>>> table?
>>>>
>>>>
>>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think we have an agreement, not to change the behavior wrt existing
>>>>> non-Iceberg tables, and throw an exception.
>>>>>
>>>>> Are we also in agreement with the original proposal to return true
>>>>> when the table exists but the metadata is somehow corrupted? Note: this is
>>>>> the proposed change of behavior why the thread was originally started.
>>>>>
>>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:
>>>>>
>>>>>> I'd argue against changing this. The current behavior's intent is not
>>>>>> to check whether the metadata is valid, it is to detect whether the table
>>>>>> is an Iceberg table. It ignores non-Iceberg tables. Changing that 
>>>>>> behavior
>>>>>> would be surprising, especially if we started throwing exceptions.
>>>>>>
>>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> > Should add, my personal preference is probably not to change the
>>>>>>> existing behavior for this part
>>>>>>>
>>>>>>> +1. I realized that this is not a new behavior. The `loadTable`
>>>>>>> implementation has this problem too.
>>>>>>> It would be good to have a test case specifically for this edge case
>>>>>>> and maybe call this out in the documentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Should add, my personal preference is probably not to change the
>>>>>>>> existing behavior for this part (false, if exists a Hive table with 
>>>>>>>> same
>>>>>>>> name) at the moment, just adding another possibility for consideration.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I
>>>>>>>>> guess a third option instead of returning true/false in this case, is 
>>>>>>>>> to
>>>>>>>>> change it to throw an NoSuchIcebergTableException if its a non-Iceberg
>>>>>>>>> table, which I think is actually what this pr does?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> I think what Kevin says makes sense. However, it would then
>>>>>>>>>> confuse the opposite use case of this function. Let's assume that we 
>>>>>>>>>> change
>>>>>>>>>> the implementation of tableExists() to not load the table internally:
>>>>>>>>>>
>>>>>>>>>> if (tableExists(table_name)) {
>>>>>>>>>>     table = loadTable(table_name);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Here, you find that the table exists but when you try to load it
>>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any 
>>>>>>>>>> of these
>>>>>>>>>> 2 are intuitive. I think the question here is how much an API of the
>>>>>>>>>> Iceberg table format should know about the existence of tables in 
>>>>>>>>>> other
>>>>>>>>>> formats.
>>>>>>>>>>
>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>>>>> HMS
>>>>>>>>>>
>>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an
>>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not 
>>>>>>>>>> sure if
>>>>>>>>>> any of the 2 approaches are better than the other, I just wanted to 
>>>>>>>>>> show
>>>>>>>>>> that there is another side of the coin :)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Steve,
>>>>>>>>>>>
>>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>>>>>
>>>>>>>>>>> I do have a question about usage, though. Typically, I would use
>>>>>>>>>>> `tableExists` like this:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>>>>     table = createTable(table_name);
>>>>>>>>>>> }
>>>>>>>>>>> ```
>>>>>>>>>>> What happens when a Hive table with the same name already exists
>>>>>>>>>>> in the catalog? In the current implementation, `tableExists` would 
>>>>>>>>>>> return
>>>>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>>>>>>> `NoSuchTableException`.
>>>>>>>>>>> This would cause the code above to attempt to create the table,
>>>>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in
>>>>>>>>>>> the HMS, perhaps it should return true even when a Hive table with 
>>>>>>>>>>> the same
>>>>>>>>>>> name exists.
>>>>>>>>>>>
>>>>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <
>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> It's a good performance find and improvement.   Left some
>>>>>>>>>>>> comment on the PR.
>>>>>>>>>>>>
>>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>>>>>>> supportive of it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Szehon
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>>>>
>>>>>>>>>>>>>   I have a proposal to simplify the tableExists API in the
>>>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to 
>>>>>>>>>>>>> hear your
>>>>>>>>>>>>> thoughts.
>>>>>>>>>>>>>
>>>>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>>>>>> It returns true if the table can be loaded without
>>>>>>>>>>>>> exceptions. This behavior implies two checks:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>>>>>>
>>>>>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>>>>>>> separates
>>>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata 
>>>>>>>>>>>>> not
>>>>>>>>>>>>> corrupted). Such a change could improve the performance of 
>>>>>>>>>>>>> existence
>>>>>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>>>>>> abstracted as
>>>>>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in
>>>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not 
>>>>>>>>>>>>> have any
>>>>>>>>>>>>> negative impact.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2]:
>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Steve Zhang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Reply via email to