Re: Querying older versions of an Iceberg table

russell . spitzer Sun, 16 May 2021 08:04:25 -0700

In the real system each file would have a unique universal identifier. When 
iceberg does a delete it doesn’t actually remove the file it creates a new 
meta-data file which no longer includes that file. When you attempt to access 
the table of time one you were actually just reading the first meta-data file 
enough the new meta-data file which is missing the entry for the deleted file.

The only way to end up in the scenario you describe is if you were manually 
deleting files and adding files using the iceberg internal API and not some 
thing like spark or flink.

What actually happens is some thing like at 
T1 metadata says f1-uuid exists

The data is deleted
T2 metadata no longer list f1

New data is written
T3 metadata says f3_uuid now exists

Data files are only physically deleted by iceberg through the expire snapshots 
command. This removes the snapshot meta-data as well as any data files which 
are only referred to by those snap shots that are expired.

If you are using the internal api (org.apache.iceberg.Table) then it is your 
responsibility to not perform operations or delete files that would violate the 
uniqueness of each snapshot. In this case you would similarly solve the problem 
by just not physically deleting the file when you remove it. Although usually 
having unique names every time you add data is a good safety measure.

> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]> wrote:
> 
> 
> Hi,
> 
> I would like to understand if Iceberg supports the following scenario:
> At time t1, there's a table with a file f1.parquet
> At time t2, f1.parquet is removed from the table. f1.parquet is also deleted 
> from the filesystem
> Querying table@t1 results in errors since f1.parquet is no longer available 
> in the filesystem
> At time t3, f1.parquet is recreated and added back to the table
> Querying table@t1 now results in potentially incorrect results since 
> f1.parquet is now present in the filesystem
> Should there be a version identifier for each data-file in the manifest file 
> to handle such scenarios?
> 
> Thanks
> Vivek
>

Re: Querying older versions of an Iceberg table

Reply via email to