dmgcodevil commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876127259


   Understood, let's say we have the following snapshots:
   
   snapshot_1 (ts=1) contains files A,B
   snapshot_2 (ts=2) contains files C,D
   
   ts - timestamp
   
   If I expire snapshot_1, would I be able to query data from files A and B? 
Based on your explanation, I should because snapshot_2's  manifest list 
includes A and B. thus only snapshot_1 metadata can be removed (.metadata.json, 
snap-*.avro) but not data files: A, B
   
   what will happen if I expire snapshots by timestamp less than 3. will Expire 
Snapshots delete A, B, C, D ?
   
   i.e. if I've made a mistake and somehow specified a very large timestamp, it 
will expire all my snapshots and potentially kill all data files ? I think that 
`RemoveOrphanFiles ` will definitely delete files. 
   
   Let me explain my case and the outcome. 
   
   I hade a table like the one below 
   
   
   snapshot_1 A, B (2021-07-05)
   snapshot_2 C, D (2021-07-06)
   
   table: A,B,C,D
   
   my data is partitioned by day
   
   2021-07-05 contains: A,B,
   2021-07-06 contains: C,D
   I wanted to combine files from 2021-07-05
   
   ```scala
   Actions.forTable(table).rewriteDataFiles()
         .filter(Expressions.greaterThanOrEqual(field, startDate * 1000))
         .filter(Expressions.lessThan(field, endDate * 1000))
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   snapshot_1 (ts=1) A, B 
   snapshot_2 (ts=2) C, D  
   snapshot_3 (ts=3) F - added , A-deleted, B-deleted
   
   ts - timestamp
   
   table: C,D,F
   
   2021-07-05 contains: A,B,F
   2021-07-06 contains: C,D
   
   I executed Expire Snapshots where ts < 3
   
   After this operation, I've noticed that  some files got deleted from 
`metadata` folder, but A, B still were in data folder: 2021-07-05
   
   Then I executed `RemoveOrphanFiles `. And noticed that a lot of files 90% 
removed from metadata folder, some files got deleted from `2021-07-06` and 
other days (that I didn't expect). I have about 4 months of data, and I noticed 
some files get deleted from different days, months, etc. 
   
   the list looks like this:
   
   ```
   2020-11-17
   2020-11-18
   2020-11-19
   2020-11-20
   2020-11-21
   2020-11-22
   2020-11-23
   2020-11-24
   2020-11-25
   2020-11-26
   2020-11-27
   2020-11-28
   2020-11-29
   2020-11-30
   2020-12-01
   2020-12-02
   2020-12-03
   2020-12-04
   2020-12-05
   2020-12-06
   2020-12-07
   2020-12-08
   2020-12-09
   2020-12-10
   2020-12-11
   2020-12-12
   2020-12-13
   2020-12-14
   2020-12-15
   2020-12-16
   2020-12-17
   2020-12-18
   2020-12-19
   2020-12-20
   2020-12-21
   2020-12-22
   2020-12-23
   2020-12-24
   2020-12-25
   2020-12-26
   2020-12-27
   2020-12-28
   2020-12-29
   2020-12-30
   2020-12-31
   2021-01-15
   2021-01-16
   2021-01-17
   2021-01-18
   2021-01-19
   2021-01-20
   2021-01-21
   2021-01-22
   2021-01-23
   2021-01-24
   2021-01-25
   2021-01-26
   2021-01-27
   2021-01-28
   2021-01-29
   2021-01-30
   2021-01-31
   2021-02-01
   2021-03-23
   2021-03-24
   2021-03-25
   2021-03-31
   2021-04-24
   2021-04-28
   2021-04-29
   2021-05-05
   2021-05-07
   2021-06-02
   ```
   
   
   So, if I accidentally expired all snapshots, then I don't understand why 
`RemoveOrphanFiles` all the files. 
   Maybe those files were never in the table. B/c I know that the spark job was 
failing periodically.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to