cccs-jc commented on issue #1931:
URL: https://github.com/apache/iceberg/issues/1931#issuecomment-745447983
Iceberg 0.9.0
spark 3.0.0
Steps to reproduce
```
def createDataFrame(startid, now, numRows, numFiles):
df = spark.range(start=startid, end=numRows+startid,
numPartitions=numFiles)
df1 = ( df.select(
df.id,
(now + (df.id *
INCREMENT_PER_FILE)).cast(TimestampType()).alias('loadedby'),
(now + (df.id * INCREMENT_PER_FILE) - (5 * INCREMENT_PER_FILE)
).cast(TimestampType()).alias('eventtime'),
F.expr('concat(uuid())').alias('data')
))
return df1
now = 1607043600
FILES_PER_HOUR = 500
SECONDS_PER_HOUR = 60 * 60
INCREMENT_PER_FILE = SECONDS_PER_HOUR / FILES_PER_HOUR
NUM_ROWS = FILES_PER_HOUR
NUM_FILES = FILES_PER_HOUR
startid = 0
df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
df.createOrReplaceTempView('TMP_TABLE')
# create initial table
CREATE OR REPLACE TABLE iceberg.test.danglingmetadata
USING iceberg
PARTITIONED BY (hours(loadedby), hours(eventtime))
TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled'='true',
'write.metadata.previous-versions-max'='1'
)
AS SELECT * FROM TMP_TABLE
# replace partitions with insert overwrite
startid = startid + (NUM_ROWS)
NUMBER_OF_HOURS = 1
NUMBER_OF_LOADS = NUMBER_OF_HOURS * LOADS_PER_HOUR
for i in range(0, NUMBER_OF_LOADS):
global startid
print(startid)
NUM_ROWS = FILES_PER_HOUR / LOADS_PER_HOUR
NUM_FILES = FILES_PER_HOUR / LOADS_PER_HOUR
df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
df.createOrReplaceTempView('TMP_TABLE')
spark.sql(f'INSERT OVERWRITE iceberg.test.danglingmetadata SELECT * FROM
TMP_TABLE')
startid = startid + (NUM_ROWS)
```
```
spark.table(f'iceberg.test.danglingmetadata.snapshots').count()
```
21 snapshots
```
spark.table(f'iceberg.test.danglingmetadata.manifests').count()
```
2 manifests
Listing metadata directory shows 65 files in total, 21 of which are
snap-*.avro files
Running expire snapshot (up to last minute)
```
table.expireSnapshots().expireOlderThan(tsToExpire).commit()
```
In my example table the expire Action removed the snapshot and manifest
files.
Listing metadata directory now shows 6 files in total, 1 of which are
snap-*.avro files.
I'm not sure why the other table we have got into a "bad" state. Maybe there
were some failed operations... When the table is in that state the expire
snapshot does not remove un-referenced metadata files. This is reasonable
however it would be useful to have an Action which specifically removes
"dangling" metadata files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]