dmgcodevil edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876127259
Understood, let's say we have the following snapshots:
snapshot_1 (ts=1) contains files A,B
snapshot_2 (ts=2) contains files C,D
ts - timestamp
If I expire snapshot_1, would I be able to query data from files A and B?
Based on your explanation, I should because snapshot_2's manifest list
includes A and B. thus only snapshot_1 metadata can be removed (.metadata.json,
snap-*.avro) but not data files: A, B
what will happen if I expire snapshots by timestamp less than 3. will Expire
Snapshots delete A, B, C, D ?
i.e. if I've made a mistake and somehow specified a very large timestamp, it
will expire all my snapshots and potentially kill all data files ? I think that
`RemoveOrphanFiles ` will definitely delete files.
Let me explain my case and the outcome.
I hade a table like the one below
snapshot_1 A, B (2021-07-05)
snapshot_2 C, D (2021-07-06)
table: A,B,C,D
my data is partitioned by day
2021-07-05 contains: A,B,
2021-07-06 contains: C,D
I wanted to combine files from 2021-07-05
```scala
Actions.forTable(table).rewriteDataFiles()
.filter(Expressions.greaterThanOrEqual(field, startDate * 1000))
.filter(Expressions.lessThan(field, endDate * 1000))
.targetSizeInBytes(targetSizeMB * 1024 * 1024)
.execute()
```
snapshot_1 (ts=1) A, B
snapshot_2 (ts=2) C, D
snapshot_3 (ts=3) F - added , A-deleted, B-deleted
ts - timestamp
table: C,D,F
2021-07-05 contains: A,B,F
2021-07-06 contains: C,D
I executed Expire Snapshots where ts < 3
After this operation, I've noticed that some files got deleted from
`metadata` folder, but A, B were still in data folder: 2021-07-05
Then I executed `RemoveOrphanFiles `. And noticed that a lot of files 90%
removed from metadata folder, some files got deleted from `2021-07-06` and
other days (that I didn't expect). I have about 4 months of data, and I noticed
some files get deleted from different days, months, etc.
the list looks like this:
```
2020-11-17
2020-11-18
2020-11-19
2020-11-20
2020-11-21
2020-11-22
2020-11-23
2020-11-24
2020-11-25
2020-11-26
2020-11-27
2020-11-28
2020-11-29
2020-11-30
2020-12-01
2020-12-02
2020-12-03
2020-12-04
2020-12-05
2020-12-06
2020-12-07
2020-12-08
2020-12-09
2020-12-10
2020-12-11
2020-12-12
2020-12-13
2020-12-14
2020-12-15
2020-12-16
2020-12-17
2020-12-18
2020-12-19
2020-12-20
2020-12-21
2020-12-22
2020-12-23
2020-12-24
2020-12-25
2020-12-26
2020-12-27
2020-12-28
2020-12-29
2020-12-30
2020-12-31
2021-01-15
2021-01-16
2021-01-17
2021-01-18
2021-01-19
2021-01-20
2021-01-21
2021-01-22
2021-01-23
2021-01-24
2021-01-25
2021-01-26
2021-01-27
2021-01-28
2021-01-29
2021-01-30
2021-01-31
2021-02-01
2021-03-23
2021-03-24
2021-03-25
2021-03-31
2021-04-24
2021-04-28
2021-04-29
2021-05-05
2021-05-07
2021-06-02
```
So, if I accidentally expired all snapshots, then I don't understand why
`RemoveOrphanFiles` did not delete all the files.
Maybe those files were never in the table. B/c I know that the spark job was
failing periodically.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]