Hello Team.
I am delighted that the Iceberg community has brought up this matter. We have always believed that providing operability for Iceberg tables based on the file system is a very valuable feature. At the very least, we should allow users to read Iceberg tables solely through the file system. As a competitor to Iceberg, Apache Paimoin has also adopted a similar approach. It uses a set of file system-based operations to manage catalog and uses it to achieve interoperability across multiple engines/boundaries. Furthermore, in practice, we have also explored solutions for catalog management based on S3/DFS/local file systems. We only use a limited range of list operations and append operations within the file system to complete catalog management, eliminating all dependencies on operations like rename that do not have multi-file system consistency. Through this design, we have achieved reliable catalog management on object storage such as S3. If possible, after refining our prototype, we would like to contribute it to Iceberg. Tks. Lisoda. 在 2024-11-14 02:24:50,"Marc Cenac" <marc.ce...@datadoghq.com.INVALID> 写道: Thanks for the proposal Ashvin! I see value in adding this to support the use case of allowing read only access from Snowflake. Currently we push updates with an ALTER TABLE command to synchronize our internally-hosted catalog with Snowflake so a version-hint file would potentially eliminate this need. One question I have is "how could we prevent the version-hint file from being removed during the delete orphan files procedure?" If version-hint is an optional file that is not tracked in the table's metadata, it seems this file could be removed during table maintenance. On Mon, Nov 11, 2024 at 2:03 PM Ashvin A <ash...@apache.org> wrote: Hello Community, We would like to share a proposal to standardize a file system based method to identify Iceberg tables’ current snapshot. Proposal doc: Adding a File System based Consistent Method to Identify Iceberg Tables’ Current Snapshot The proposal aims to enhance the interoperability and self-sufficiency of Iceberg tables by replicating the snapshot's metadata file name (version-hint) from the catalog to the file system. This will make the table representation on the file system complete and eliminate the need for catalog dependency in certain read-only scenarios. Use Case: Microsoft Fabric now supports Iceberg tables in OneLake, allowing users to leverage Iceberg tables in addition to Delta Lake tables with Microsoft Fabric’s compute engines. Having a file system based integration reduces the number of components required in the read query execution path, especially when the catalog is inaccessible or during pre-production scenarios. Please review the proposal document and share your suggestions in the comments. We look forward to discussing this further. Best, Ashvin