Re:Re: [Proposal] Replicating version-hint onto the file system

lisoda Thu, 14 Nov 2024 01:50:07 -0800

Hello Team.


I am delighted that the Iceberg community has brought up this matter. We have 
always believed that providing operability for Iceberg tables based on the file 
system is a very valuable feature. At the very least, we should allow users to 
read Iceberg tables solely through the file system. As a competitor to Iceberg, 
Apache Paimoin has also adopted a similar approach. It uses a set of file 
system-based operations to manage catalog and uses it to achieve 
interoperability across multiple engines/boundaries. Furthermore, in practice, 
we have also explored solutions for catalog management based on S3/DFS/local 
file systems. We only use a limited range of list operations and append 
operations within the file system to complete catalog management, eliminating 
all dependencies on operations like rename that do not have multi-file system 
consistency. Through this design, we have achieved reliable catalog management 
on object storage such as S3. If possible, after refining our prototype, we 
would like to contribute it to Iceberg.


Tks.
Lisoda.













在 2024-11-14 02:24:50，"Marc Cenac" <marc.ce...@datadoghq.com.INVALID> 写道：

Thanks for the proposal Ashvin!  


I see value in adding this to support the use case of allowing read only access 
from Snowflake.  Currently we push updates with an ALTER TABLE command to 
synchronize our internally-hosted catalog with Snowflake so a version-hint file 
would potentially eliminate this need.  
 
One question I have is "how could we prevent the version-hint file from being 
removed during the delete orphan files procedure?"  If version-hint is an 
optional file that is not tracked in the table's metadata, it seems this file 
could be removed during table maintenance. 


On Mon, Nov 11, 2024 at 2:03 PM Ashvin A <ash...@apache.org> wrote:

Hello Community,

We would like to share a proposal to standardize a file system based method to 
identify Iceberg tables’ current snapshot.

Proposal doc: Adding a File System based Consistent Method to Identify Iceberg 
Tables’ Current Snapshot

The proposal aims to enhance the interoperability and self-sufficiency of 
Iceberg tables by replicating the snapshot's metadata file name (version-hint) 
from the catalog to the file system. This will make the table representation on 
the file system complete and eliminate the need for catalog dependency in 
certain read-only scenarios.

Use Case: Microsoft Fabric now supports Iceberg tables in OneLake, allowing 
users to leverage Iceberg tables in addition to Delta Lake tables with 
Microsoft Fabric’s compute engines. Having a file system based integration 
reduces the number of components required in the read query execution path, 
especially when the catalog is inaccessible or during pre-production scenarios.

Please review the proposal document and share your suggestions in the comments. 
We look forward to discussing this further. 

Best,
Ashvin

Re:Re: [Proposal] Replicating version-hint onto the file system

Reply via email to