SamWheating opened a new issue, #18858:
URL: https://github.com/apache/druid/issues/18858

   If an iceberg table using merge-on-read updates or deletes is ingested into 
druid, then the deleted rows will be ingested as well. 
   
   As a simple example, we can create a quick Iceberg table using Spark:
   
   ```scala
   val df = Seq(
       ("store_a", 1, 100),
       ("store_a", 2, 200),
       ("store_b", 3, 300),
       ("store_b", 4, 400),
   ).toDF("store_id", "item_count", "price_total")
   
   df.withColumn("ts", current_timestamp()).
       writeTo("demo.test_database.checkouts").
       using("iceberg").
       partitionedBy(hours($"ts")).
       tableProperty("write.update.mode", "merge-on-read").
       create()
   ```
   
   Then update the table:
   ```sql
   UPDATE demo.test_database.checkouts SET total_price=0 WHERE store_id = 
'store_a'
   ```
   
   Ingesting the table into druid then shows 6 rows, due to ingesting both 
versions of the updated records:
   ```sql
   SELECT * FROM "checkouts"
   
   
{"__time":"2025-12-19T00:00:00.000Z","store_id":"store_a","count":4,"sum_item_count":6,"sum_price_total":300}
   
{"__time":"2025-12-19T00:00:00.000Z","store_id":"store_b","count":2,"sum_item_count":7,"sum_price_total":700}
   ```
   
   This feels like a potential hazard which isn't explicitly called out in [the 
documentation](https://druid.apache.org/docs/latest/development/extensions-contrib/iceberg/).
   
   Ideally we would handle the delete markers and properly materialize the 
data, but thats a pretty big overhaul. As a shorter-term solution should we 
maybe just fail the ingestion if there's delete markers present in the target 
partitions?
   
   Happy to help with the implementation here, or at least just updating the 
documentation to make this more clear - let me know what you think is the best 
path forwards.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to