jordi-crespo opened a new issue, #13972: URL: https://github.com/apache/iceberg/issues/13972
### Query engine # Description: I'm experiencing issues with Iceberg's rewrite_data_files procedure when trying to compact heavily fragmented tables while concurrent streaming writes are happening. Here's my environment: - Apache Iceberg 1.9.2 - Nessie 0.10 - Apache Spark 3.5.2 - Streaming writes every few minutes - Table partitioned by days(time), entity, and bucket(8, _id_asset) # The Problem: Some partitions have extreme fragmentation (e.g., 1295 files for just 2.5MB of data). When I try to compact these using rewrite_data_files, I consistently get: ```` ValidationException: Cannot determine history between starting snapshot 1456665297449387166 and the last known ancestor 6339655410305960251 ```` # What I've Tried: 1. Using partial-progress.enabled='false' with retry logic (6 attempts with exponential backoff) 2. Using partial-progress.enabled='true' 3. Compactating at different granularities (by individual assets, by date ranges) 4. Adjusting various compaction parameters # Additional Context: The streaming process writes data every few minutes, and I need to regularly compact partitions that become heavily fragmented (1000+ files for just a few MB of data). The errors occur specifically on the most fragmented partitions. Any guidance would be greatly appreciated! ### Question 1. What is the recommended approach for handling compaction with concurrent streaming writes? 2. Is there a way to make rewrite_data_files more resilient to concurrent modifications? 3. Should I be using different isolation levels or snapshot management techniques? 4. Is pausing streaming writes the only reliable solution for severely fragmented partitions? 5. Are there any planned improvements for automatic compaction in scenarios with frequent writes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org