Re: Spark: Copy Table Action

Sumedh Sakdeo Fri, 12 Jul 2024 10:09:48 -0700

This is a useful addition, I believe it is important to list down requirements 
for such an action in greater details, especially what is in scope and what is 
not. Some open questions that could be added to the requirements / 
non-requirements section are


  1.  Should the copied table registered in same catalog as the source table, 
or they are copied in a different catalog for the destination table?
     *   This has implications on the table identifier, and how the metadata is 
copied.
  2.  Are we shooting for perfect query reproducibility for time travel queries 
across the source and copy table? I.e. Is the snapshot chain on the source 
table be maintained on the copied table?
     *   Spec talks about rebuilding metadata, but it would be clearer if it 
said if the entire snapshot chain was maintained or we are rebuilding metadata 
in a way that only data in current snapshot matches between source and 
destination.
  3.  Is this a one-time copy action, or is this something we can run on a 
schedule, i.e. as new data is written to source table, incremental deltas 
(appends, updates, deletes) will be copied?
     *   Later has implications to consider as various maintenance jobs run on 
source and destination can diverge the snapshot chain.

At LinkedIn, we ran into the absolute v/s relative path issue when designing 
snapshot replication for Iceberg tables. The way we approached it is we use 
absolute path of the file in the metadata, without the scheme and cluster. We 
use HadoopFileIO, the scheme and cluster is derived from the Hadoop conf. For 
example, if the file path is, hdfs://<cluster>/data/openhouse/db/tb_uuid, what 
is stored in Iceberg metadata is /data/openhouse/db/tb_uuid, and 
hdfs://<cluster> comes from Hadoop conf.

Has the community considered an approach where the scheme and cluster is minted 
by the catalog, to be used in the respective FileIO implementation for the blob 
stores. For example, if we had a bucket foo on us-east, and bucket bar on 
us-west, the catalog running on us-east would mint s3://foo, and the catalog 
running on us-west would mint s3://bar, and the S3FileIO would join that with 
rest of the relative path to the object. This would allow us to capture the 
absolute path relative to s3://<bucket-name> in the Iceberg metadata?

Thanks,
-sumedh

From: Pucheng Yang <[email protected]>
Date: Thursday, July 11, 2024 at 8:15 AM
To: [email protected] <[email protected]>
Subject: Re: Spark: Copy Table Action

Hi Yufei, I was wondering if we also want to support the use case of moving 
tables in this proposal? For example, users might have various reasons to 
change the table location, however, there is no good way to move original data 
files to the new location unless we are doing data files rewrite, but it seems 
that we are misusing the functionality.

On Wed, Jul 10, 2024 at 9:37 AM Ajantha Bhat 
<[email protected]<mailto:[email protected]>> wrote:
For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required since 
orphan file removal should be able to cleanup the files in the copied table. 
Are we able to elaborate why there's a concern with removing snapshots on the 
copied table and subsequently relying on orphan file removal on the copied 
table to remove the actual files? Is it around listing?

I have the same concern as Amogh. I already mentioned the same thing in the PR 
yesterday<https://github.com/apache/iceberg/pull/10643#discussion_r1669739401>.
I suggested renaming it as RemoveTableCopyOrphanFiles. Thinking more on this 
today, I think we should atomically (implicitly) handle cleaning up of orphan 
files as part of copy table action instead of a separate action.

Also, very happy to see the progress on this one. This will help users to move 
the data from one location to another seamlessly.

- Ajantha


On Wed, Jul 10, 2024 at 7:35 AM Amogh Jahagirdar 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Yufei!

+1 on having a copy table action, I think that's pretty valuable. I have some 
ideas on interfaces based on previous work I've done for region/multi-cloud 
replication of Iceberg tables. The absolute vs relative path discussion is 
interesting, I have some questions on how relative pathing would look like but 
I'll wait for Anurag's input.

On CheckSnapshotIntegrity, I think I'd probably advocate for having a more 
general "Repair Metadata" procedure. Currently, it looks like 
CheckSnapshotIntegrity just tells a user what files are missing in its output. 
I think we could go a step further and attempt to handle cases where a manifest 
entry refers to a file which no longer exists. We could attempt a recovery of 
that file if the fileIO implementation supports that via some sort of a 
SupportsRecovery mixin. There's also another corruption case where duplicate 
file entries end up in manifests, we can define an approach on reconciling that 
and write out new manifests.
There's actually been two attempts on this, one from Szehon quite a while back 
https://github.com/apache/iceberg/pull/2608 and another more recently from Matt 
https://github.com/apache/iceberg/pull/10445 . Perhaps we could review both of 
these and figure out a path forward for this?
For just verifying the integrity of the copy table, we could have a dry run 
option for the repair metadata operation which would output any missing files, 
or manifests with duplicates without performing any recovery/fixing up.

For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required since 
orphan file removal should be able to cleanup the files in the copied table. 
Are we able to elaborate why there's a concern with removing snapshots on the 
copied table and subsequently relying on orphan file removal on the copied 
table to remove the actual files? Is it around listing?

Overall this is great to see.

Thanks,
Amogh Jahagirdar




On Tue, Jul 9, 2024 at 10:59 AM Anurag Mantripragada 
<[email protected]> wrote:
Agreed with Peter. I will bring relative paths changes up in the next community 
sync. I will help drive this.


~ Anurag Mantripragada






On Jul 8, 2024, at 10:50 PM, Péter Váry 
<[email protected]<mailto:[email protected]>> wrote:

I think in most cases the copy table action doesn't require a query engine to 
read and generate the new metadata files. This means, that it would be nice to 
provide a pure Java implementation in the core, and it could be extended/reused 
by different engines, like Spark, to execute it in a distributed manner, when 
distributed execution is needed.

About the copy vs. relative path debate:
- I have seen the relative path requirement coming up multiple times in the 
past. Seems like a feature requested by multiple users, so I think it would be 
the best to discuss it in a different thread. The Copy Table Action might be 
used to move absolute path tables to relative path tables when migration is 
needed.

On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada 
<[email protected]> wrote:
Hi Yufei.

Thanks for the proposal. While the actions are great, they still need to do a 
lot of work which can be reduced if we have the relative path changes. I still 
support adding these actions as moving data was out of scope for the relative 
path design and we can use these actions as helpers when the spec change is 
done.

Anurag Mantripragada

On Jul 8, 2024, at 10:55 AM, Pucheng Yang 
<[email protected]<mailto:[email protected]>> wrote:

Thanks for picking this up, I think this is a very valuable addition.

On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu 
<[email protected]<mailto:[email protected]>> wrote:
Hi folks,

I'd like to share a recent progress of adding actions to copy tables across 
different places.

There is a constant need to copy tables across different places for purposes 
such as disaster recovery and testing. Due to the absolute file paths in 
Iceberg metadata, it doesn't work automatically. There are three generic 
solutions:
1. Rebuild the metadata: This is a proven approach widely used across various 
companies.
2. S3 access point: Effective when both the source and target locations are in 
S3, but not applicable to other storage systems.
3. Relative path: It requires changes to the table specification.

We focus on the first approach in this thread. While the code has been shared 2 
years ago here<https://github.com/apache/iceberg/pull/4705>, it has never been 
merged. We picked it up recently. Here are the active PRs related to this 
action. Would really appreciate any feedback and review:

  *   PR to add CopyTable action: https://github.com/apache/iceberg/pull/10024
  *   PR to add CheckSnapshotIntegrity action: 
https://github.com/apache/iceberg/pull/10642
  *   PR to add RemoveExpiredFiles 
action:https://github.com/apache/iceberg/pull/10643

Here is a google doc with more details to clarify the goals and approach: 
https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing

Yufei

Re: Spark: Copy Table Action

Reply via email to