jerryshao opened a new pull request #1052: URL: https://github.com/apache/incubator-iceberg/pull/1052
If we don't use qualified path (`file:/temp/test_db`) to create or save into (Hadoop) table, then the file_path queried out is not a qualified path, for example: ``` private Dataset<Row> buildValidDataFileDF() { String allDataFilesMetadataTable = metadataTableName(MetadataTableType.ALL_DATA_FILES); return spark.read().format("iceberg") .load(allDataFilesMetadataTable) .select("file_path"); } ``` The result here could be: ``` +-----------------------------------------------------------------------------------+ |file_path | +-----------------------------------------------------------------------------------+ |tmp/iceberg_test2/data/00000-172-2805f207-2c0d-4717-acc2-fed60430afeb-00000.parquet| |tmp/iceberg_test2/data/00001-173-bd5e807d-e96f-49de-b84e-2c254c0777bb-00000.parquet| |tmp/iceberg_test2/data/00002-174-fb7f84f0-d2ed-4e53-b5b9-6ef2f7da8a73-00000.parquet| +-----------------------------------------------------------------------------------+ ``` But the code here `file.getPath().toString()` in `RemoveOrphanFilesAction#listDirRecursively` returns qualified path: ``` for (FileStatus file : fs.listStatus(path, HiddenPathFilter.get())) { if (file.isDirectory()) { subDirs.add(file.getPath().toString()); } else if (file.isFile() && predicate.test(file)) { matchingFiles.add(file.getPath().toString()); } } ``` So the join condition `equalTo` may not correctly get the orphan files and delete the file mistakenly. So here propose to fix the join condition to `contains`. Another solution is to change the relative path to qualified one in everywhere. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org