raghavendraD opened a new issue #3054:
URL: https://github.com/apache/iceberg/issues/3054
Hi,
RemoveOrphanFiles is working with only hadoop FS/IO and when run from local
with hadoop catalog. when i try to run it for S3 files using glue catalog and
from EMR. It throws the below error. i have tried with both iceberg 11,12 and
also spark 3.0.1, spark 3.1.1 (all combinations) and also tried both the
commands from Actions API and also from Spark Actions API. the result does not
change.
Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
or
SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
and the error is
21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal
Arguments in table properties - Can't parse null value from table properties,
tenant: tenantId1, table: lakehouse_database.mobiletest1,
removeOrphanFilesOlderThan: 1630388136606, Status: Failed, Reason: {}.
java.lang.IllegalArgumentException: Cannot find the metadata table for
glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS
at
org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634)
at
org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153)
at
org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136)
at
com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
and i tried sql version of remove orphan files too and faced below error
sparkSession.sql("CALL
glue_catalog.lakehouse_database.remove_orphan_files(table =>
'db.mobiletest1')").show();
and the error is
Exception in thread "main" org.apache.iceberg.exceptions.RuntimeIOException:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme
"s3"
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214)
Is it something to do with my implementation or is it a bug with an iceberg?
or am i missing something her? please help !
Thanks,
Raghu
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]