I actually think there is an argument against that use case of returning an error after time t3. Maybe the user does want to change a row in a file directly and replace the file to get an updated result quickly bypassing the Iceberg API. In that case failing that query after t3 would block that use case. The statistics in manifest might be wrong, but we can further argue that the user can directly modify statistics and replace files all the way up to the snapshot to make sure everything continues to work.
In general, if a user decides to bypass the contract set by Iceberg, I believe that we should not predict the behavior and compensate the system for that behavior, because users can bypass the contract in all different ways and it will open the door to satisfy many awkward use cases and in the end break assumptions to the fundamentals. In this case you described, I think the existing Iceberg behavior makes total sense. If you would like to achieve what you described later, you can potentially update your FileIO and leverage the versioning feature of the underlying storage to make sure that the file uploaded never has the same identifier, so that users cannot replace a file at t3. For example, if you are running on S3, you can enable S3 versioning, and extend the S3FIleIO so that each file path is not just the s3 path, but the s3 path + version. But this is just what I think, let's see how others reply. -Jack On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]> wrote: > From an Iceberg perspective, I understand what you are saying. > > A lot of our customers add/remove files to the table using scripts. The > typical workflow would be: > - Create Parquet files using other tools > - Add these files to the Iceberg table > > Similarly, for removing Parquet files from the table. I understand that > Iceberg doesn't delete the data file until all snapshots that refer to that > data file expire. However, the customer can delete the file directly - they > might understand that a query on a snapshot will fail. > > I am concerned that an unintentional mistake in updating the Iceberg table > results in incorrect results while querying an Iceberg snapshot. It is ok > to return an error when a file referred to by a snapshot does not exist. > > This issue can be addressed by adding a version identifier (e.g. mtime) in > the DataFile object and including this information in the manifest file. > This ensures that snapshot reads are correct even when users make mistakes > while adding/removing files to the table. > > We can work on this, if there is sufficient interest. > > On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote: > >> In the real system each file would have a unique universal identifier. >> When iceberg does a delete it doesn’t actually remove the file it creates a >> new meta-data file which no longer includes that file. When you attempt to >> access the table of time one you were actually just reading the first >> meta-data file enough the new meta-data file which is missing the entry for >> the deleted file. >> >> The only way to end up in the scenario you describe is if you were >> manually deleting files and adding files using the iceberg internal API and >> not some thing like spark or flink. >> >> What actually happens is some thing like at >> T1 metadata says f1-uuid exists >> >> The data is deleted >> T2 metadata no longer list f1 >> >> New data is written >> T3 metadata says f3_uuid now exists >> >> Data files are only physically deleted by iceberg through the expire >> snapshots command. This removes the snapshot meta-data as well as any data >> files which are only referred to by those snap shots that are expired. >> >> If you are using the internal api (org.apache.iceberg.Table) then it is >> your responsibility to not perform operations or delete files that would >> violate the uniqueness of each snapshot. In this case you would similarly >> solve the problem by just not physically deleting the file when you remove >> it. Although usually having unique names every time you add data is a good >> safety measure. >> >> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]> >> wrote: >> >> >> Hi, >> >> I would like to understand if Iceberg supports the following scenario: >> >> - At time t1, there's a table with a file f1.parquet >> - At time t2, f1.parquet is removed from the table. f1.parquet is >> also deleted from the filesystem >> - Querying table@t1 results in errors since f1.parquet is no longer >> available in the filesystem >> - At time t3, f1.parquet is recreated and added back to the table >> - Querying table@t1 now results in potentially incorrect results >> since f1.parquet is now present in the filesystem >> >> Should there be a version identifier for each data-file in the manifest >> file to handle such scenarios? >> >> Thanks >> Vivek >> >>
