Vinothkarthick opened a new issue #11647:
URL: https://github.com/apache/druid/issues/11647


   ### Problem
   
   In druid , on 1 dataset we have 2400 segments. We have 30 datasets of such. 
On a daily basis, we see the records belong to these 2400 segments gets 
updated. The updated records are very low ( < 0.1 % ) , but it spans across 
each segment. Due to this we end up in doing backfill of all datasets on daily 
basis.
   
   - We are doing batch Ingestion using index_parallel type on daily basis. The 
dataset that we load into druid from SnowFlake ( where we have enterprise wide 
data ) gets updated to even 20 years back on daily basis. The total records 
that got updated would be less than 1% of the total records in the table. But 
this 1% of updated data span across all segments in druid. So we are doing  a 
daily backfill of entire dataset in druid on daily basis. 
   
   - Due to the above use case, the cost of the druid cluster is shooting up 
due to large number of Middle manager Nodes.
   
   ### Ask
   - If there is a way to update only the records that got change ( May be a 
SQL Merge kind of functionality ), this would be beneficial.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to