jordi-crespo opened a new issue, #13972:
URL: https://github.com/apache/iceberg/issues/13972

   ### Query engine
   
   # Description:
   
   I'm experiencing issues with Iceberg's rewrite_data_files procedure when 
trying to compact heavily fragmented tables while concurrent streaming writes 
are happening. Here's my environment:
   
   - Apache Iceberg 1.9.2
   
   - Nessie 0.10
   
   - Apache Spark 3.5.2
   
   - Streaming writes every few minutes
   
   - Table partitioned by days(time), entity, and bucket(8, _id_asset)
   
   # The Problem:
   
   Some partitions have extreme fragmentation (e.g., 1295 files for just 2.5MB 
of data). When I try to compact these using rewrite_data_files, I consistently 
get:
   
   ````
   ValidationException: Cannot determine history between starting snapshot 
1456665297449387166 and the last known ancestor 6339655410305960251
   ````
   
   # What I've Tried:
   
   1. Using partial-progress.enabled='false' with retry logic (6 attempts with 
exponential backoff)
   
   2. Using partial-progress.enabled='true'
   
   3. Compactating at different granularities (by individual assets, by date 
ranges)
   
   4. Adjusting various compaction parameters
   
   
   # Additional Context:
   The streaming process writes data every few minutes, and I need to regularly 
compact partitions that become heavily fragmented (1000+ files for just a few 
MB of data). The errors occur specifically on the most fragmented partitions.
   
   Any guidance would be greatly appreciated!
   
   
   
   ### Question
   
   1. What is the recommended approach for handling compaction with concurrent 
streaming writes?
   
   2. Is there a way to make rewrite_data_files more resilient to concurrent 
modifications?
   
   3. Should I be using different isolation levels or snapshot management 
techniques?
   
   4. Is pausing streaming writes the only reliable solution for severely 
fragmented partitions?
   
   5. Are there any planned improvements for automatic compaction in scenarios 
with frequent writes?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to