kongul opened a new issue, #7890: URL: https://github.com/apache/iceberg/issues/7890
### Apache Iceberg version 1.2.1 ### Query engine Spark ### Please describe the bug 🐞 We have number of Spark jobs that do stream data to Iceberg tables. Recently we faced issue reading those tables - data files were deleted or overridden by other data files with different size (checked older version in s3 bucket). After Investigation this i what we found. Here's how filename is constructed https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L51-L100 As it said there ``` * Constructor with specific operationId. The [partitionId, taskId, operationId] triplet has to be * unique across JVM instances otherwise the same file name could be generated by different * instances of the OutputFileFactory. ``` Here we can see that `queryId` is passed as `operationId` Now let's see what is passed there from Spark side https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L159 https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L134C1-L143 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala So stream metadata file contain in queryId is persisted across Spark Streaming Jobs restarts, hence your requirement `The [partitionId, taskId, operationId] triplet has to be unique` is violatet. So new streaming job run can generate the same filename that already exists and override exiting file. https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L91-L100 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
