felixhzhu opened a new issue, #4237:
URL: https://github.com/apache/amoro/issues/4237

   ### What happened?
   
   When an Iceberg table's underlying FileIO is an object-store-native one 
(S3FileIO / OSSFileIO / GCSFileIO / ResolvingFileIO, commonly seen with Glue / 
REST / Polaris catalogs), the AMS-side expireSnapshots task always fails with 
`IllegalArgumentException: null` thrown from 
`AuthenticatedFileIOAdapter.asFileSystemIO()`. SnapshotsExpiringProcess is 
marked FAILED for every such table, even though the underlying data/manifest 
files have already been deleted successfully.
   
   Root cause:
   RollingFileCleaner.doCleanFiles() unconditionally calls 
TableFileUtil.deleteEmptyDirectory(), which invokes io.asFileSystemIO(). For 
object-store FileIOs, AuthenticatedFileIOAdapter.supportFileSystemOperations() 
returns false, so the single-arg Preconditions.checkArgument(boolean) throws 
IllegalArgumentException with a null message — matching the stack trace exactly.
   
   The "delete empty parent directory" step only has meaning under HDFS / 
Hadoop FS semantics. On object stores there is no directory entity, the prefix 
disappears automatically when the last object is deleted, so this step is a 
no-op by definition.
   
   Expected: expireSnapshots should succeed on object-store-backed tables; the 
directory-cleanup step should be skipped silently when the FileIO does not 
support filesystem operations.
   
   (There is also a downstream issue: once the process is marked FAILED, 
DefaultTableProcessStore.tryTransitState rejects the subsequent RETRY_REQUESTED 
event because currentStatus is FAILED, producing a WARN storm. I will file that 
as a separate issue since the fix lives in amoro-ams, not amoro-format-iceberg.)
   
   ### Affects Versions
   
   0.9.0-incubating (also reproducible on the 0.9.x branch HEAD)
   
   ### What table formats are you seeing the problem on?
   
   Iceberg
   
   ### What engines are you seeing the problem on?
   
   AMS
   
   ### How to reproduce
   
   Environment:
   - Amoro 0.9.0-incubating, single AMS node.
   - Iceberg catalog: Glue / REST / Polaris (anything that produces a table 
whose io-impl is an object-store-native FileIO).
   - Table property io-impl is one of: S3FileIO / OSSFileIO / GCSFileIO / 
ResolvingFileIO.
   - Table has at least one expirable snapshot.
   
   Steps:
   1. Register the Iceberg catalog into AMS.
   2. Pick a table whose oldest snapshot is older than 
history.expire.max-snapshot-age-ms.
   3. Trigger expireSnapshots — either via the AMS dashboard, or wait for 
SnapshotsExpiringExecutor.
   4. AMS log immediately reports IllegalArgumentException: null and the 
process is marked FAILED.
   
   Observations:
   - table.expireSnapshots()...commit() actually succeeds (old snapshots 
removed from metadata.json).
   - RollingFileCleaner's bulk delete actually succeeds (S3FileIO.deleteFiles 
is invoked correctly).
   - Only the trailing "delete empty parent directories" step throws and flips 
the whole process to FAILED.
   
   Reproduces 100% of the time on every object-store-backed table. HDFS / 
Hadoop FS tables (HadoopFileIO, including s3a/cosn/oss) are NOT affected.
   
   ### Relevant log output
   
   ```shell
   ERROR [local-expire-snapshots-7] 
[org.apache.amoro.server.process.iceberg.SnapshotsExpiringProcess]
   - unexpected expire error of table xxx(tableId=92)
   java.lang.IllegalArgumentException: null
       at 
org.apache.amoro.shade.guava32.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
       at 
org.apache.amoro.io.AuthenticatedFileIOAdapter.asFileSystemIO(AuthenticatedFileIOAdapter.java:98)
       at 
org.apache.amoro.utils.TableFileUtil.deleteEmptyDirectory(TableFileUtil.java:95)
       at 
org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.doCleanFiles(RollingFileCleaner.java:114)
       at 
org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.clear(RollingFileCleaner.java:133)
       at 
org.apache.amoro.formats.iceberg.maintainer.IcebergTableMaintainer.expireSnapshots(IcebergTableMaintainer.java:...)
   ```
   
   ### Anything else
   
   Frequency: every scheduled run, every object-store-backed table. Verified on 
AWS S3 (Glue) and on a Polaris-backed REST catalog.
   
   Proposed fix (already implemented locally, happy to upstream as a PR):
   
   1) TableFileUtil.deleteEmptyDirectory: early-return when 
!io.supportFileSystemOperations(). Object stores have no real directories — 
this step is a no-op by definition, and it's also where the existing exception 
originates.
   
   2) RollingFileCleaner.doCleanFiles:
      - guard the "delete empty parent directories" loop with the same 
supportFileSystemOperations() check (defense in depth + explicit intent),
      - catch per-directory exceptions as WARN so one bad parent dir does not 
abort the batch,
      - move parentDirectories.clear() and collectedFiles.clear() into a 
`finally` block so a failure in directory cleanup never causes already-deleted 
paths to be re-submitted on the next 1000-file flush.
   
   No schema / config / Thrift IDL changes. No public API / SPI signature 
changes. HadoopFileIO-based tables are completely unaffected — the new guards 
return true and code follows the original path. BasicTableTrashManager and 
other callers of deleteEmptyDirectory benefit from the same guard for free.
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's Code of Conduct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to