felixhzhu opened a new issue, #4237:
URL: https://github.com/apache/amoro/issues/4237
### What happened?
When an Iceberg table's underlying FileIO is an object-store-native one
(S3FileIO / OSSFileIO / GCSFileIO / ResolvingFileIO, commonly seen with Glue /
REST / Polaris catalogs), the AMS-side expireSnapshots task always fails with
`IllegalArgumentException: null` thrown from
`AuthenticatedFileIOAdapter.asFileSystemIO()`. SnapshotsExpiringProcess is
marked FAILED for every such table, even though the underlying data/manifest
files have already been deleted successfully.
Root cause:
RollingFileCleaner.doCleanFiles() unconditionally calls
TableFileUtil.deleteEmptyDirectory(), which invokes io.asFileSystemIO(). For
object-store FileIOs, AuthenticatedFileIOAdapter.supportFileSystemOperations()
returns false, so the single-arg Preconditions.checkArgument(boolean) throws
IllegalArgumentException with a null message — matching the stack trace exactly.
The "delete empty parent directory" step only has meaning under HDFS /
Hadoop FS semantics. On object stores there is no directory entity, the prefix
disappears automatically when the last object is deleted, so this step is a
no-op by definition.
Expected: expireSnapshots should succeed on object-store-backed tables; the
directory-cleanup step should be skipped silently when the FileIO does not
support filesystem operations.
(There is also a downstream issue: once the process is marked FAILED,
DefaultTableProcessStore.tryTransitState rejects the subsequent RETRY_REQUESTED
event because currentStatus is FAILED, producing a WARN storm. I will file that
as a separate issue since the fix lives in amoro-ams, not amoro-format-iceberg.)
### Affects Versions
0.9.0-incubating (also reproducible on the 0.9.x branch HEAD)
### What table formats are you seeing the problem on?
Iceberg
### What engines are you seeing the problem on?
AMS
### How to reproduce
Environment:
- Amoro 0.9.0-incubating, single AMS node.
- Iceberg catalog: Glue / REST / Polaris (anything that produces a table
whose io-impl is an object-store-native FileIO).
- Table property io-impl is one of: S3FileIO / OSSFileIO / GCSFileIO /
ResolvingFileIO.
- Table has at least one expirable snapshot.
Steps:
1. Register the Iceberg catalog into AMS.
2. Pick a table whose oldest snapshot is older than
history.expire.max-snapshot-age-ms.
3. Trigger expireSnapshots — either via the AMS dashboard, or wait for
SnapshotsExpiringExecutor.
4. AMS log immediately reports IllegalArgumentException: null and the
process is marked FAILED.
Observations:
- table.expireSnapshots()...commit() actually succeeds (old snapshots
removed from metadata.json).
- RollingFileCleaner's bulk delete actually succeeds (S3FileIO.deleteFiles
is invoked correctly).
- Only the trailing "delete empty parent directories" step throws and flips
the whole process to FAILED.
Reproduces 100% of the time on every object-store-backed table. HDFS /
Hadoop FS tables (HadoopFileIO, including s3a/cosn/oss) are NOT affected.
### Relevant log output
```shell
ERROR [local-expire-snapshots-7]
[org.apache.amoro.server.process.iceberg.SnapshotsExpiringProcess]
- unexpected expire error of table xxx(tableId=92)
java.lang.IllegalArgumentException: null
at
org.apache.amoro.shade.guava32.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at
org.apache.amoro.io.AuthenticatedFileIOAdapter.asFileSystemIO(AuthenticatedFileIOAdapter.java:98)
at
org.apache.amoro.utils.TableFileUtil.deleteEmptyDirectory(TableFileUtil.java:95)
at
org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.doCleanFiles(RollingFileCleaner.java:114)
at
org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.clear(RollingFileCleaner.java:133)
at
org.apache.amoro.formats.iceberg.maintainer.IcebergTableMaintainer.expireSnapshots(IcebergTableMaintainer.java:...)
```
### Anything else
Frequency: every scheduled run, every object-store-backed table. Verified on
AWS S3 (Glue) and on a Polaris-backed REST catalog.
Proposed fix (already implemented locally, happy to upstream as a PR):
1) TableFileUtil.deleteEmptyDirectory: early-return when
!io.supportFileSystemOperations(). Object stores have no real directories —
this step is a no-op by definition, and it's also where the existing exception
originates.
2) RollingFileCleaner.doCleanFiles:
- guard the "delete empty parent directories" loop with the same
supportFileSystemOperations() check (defense in depth + explicit intent),
- catch per-directory exceptions as WARN so one bad parent dir does not
abort the batch,
- move parentDirectories.clear() and collectedFiles.clear() into a
`finally` block so a failure in directory cleanup never causes already-deleted
paths to be re-submitted on the next 1000-file flush.
No schema / config / Thrift IDL changes. No public API / SPI signature
changes. HadoopFileIO-based tables are completely unaffected — the new guards
return true and code follows the original path. BasicTableTrashManager and
other callers of deleteEmptyDirectory benefit from the same guard for free.
### Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's Code of Conduct
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]