dramaticlly opened a new issue, #6888: URL: https://github.com/apache/iceberg/issues/6888
### Apache Iceberg version 1.1.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Add files by default check for duplicate when importing external written data into iceberg tables. It read the `data_file.file_path` from entries table when comparing file path provided in source_table per https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L532-L541. However the manifest entry will have status = 2 when deleted so it incorrectly considered as deleted file path as duplicate and prevent files to be added Repro ```scala //1. create iceberg table val tableId = "iceberg.hongyue_zhang.repro" spark.sql(s"""CREATE TABLE if not exists $tableId ( id bigint, log_dateint bigint, request_dateint bigint ) USING iceberg PARTITIONED BY (log_dateint, request_dateint)"""); //2. insert some data val insertSQL = s"INSERT INTO TABLE $tableId VALUES (1, 20230220,20230221);" spark.sql(insertSQL).show //3. delete from iceberg table val deleteSQL = s"DELETE FOM $tableId") spark.sql(deleteSQL); //4. using add files to add them back and run into exception val tableIdWoCatalog = tableId.split("\\.").drop(1).mkString(".") val parquetFilePath = "s3a://bucket/warehouse/hongyue_zhang.db/repro/data" val addFilesSQL = s""" CALL iceberg.system.add_files( table =>'$tableIdWoCatalog', source_table => '`parquet`.`$parquetFilePath`' ) """.stripMargin spark.sql(addFilesSQL).show java.lang.IllegalStateException: Cannot complete import because data files to be imported already exist within the target table: s3a://bucket/warehouse/hongyue_zhang.db/r epro/data/log_dateint=20230220/request_dateint=20230211/00196-17-1f414c2a-c7aa-4f22-887e-f7126a68e9a0-00001.parquet. This is disabled by default as Iceberg is not designed for mulitple references to the same file within the same table. If you are sure, you may set 'check_duplicate_files' to false to force the import. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
