[ 
https://issues.apache.org/jira/browse/HUDI-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489028#comment-17489028
 ] 

Volodymyr Burenin commented on HUDI-2189:
-----------------------------------------

This comes down to the automation. There is no need to run delta streamer in a 
continuous mode, especially if it is a copy on write table to minimize write 
amplification. I have a scheduler that waits for enough data to be accumulated 
in the incoming kafka queue then it runs delta streamer to ingest it, aside 
from that it checks of the number of partitions in the table and removes oldest 
partitions once the time threshold has been reached.
Running a separate Spark job to remove partition simply complicates the overall 
automation process.

While I am asking for the way to do it via CLI or properties file, it is not 
necessary the best approach, it simply fits automation well in a way I have it 
currently. It is probably desirable to implement it by running some automatic 
cleanup process that can interpret a certain sharding scheme and remove 
obsolete partition on the delta streamer end by just saying: 
hoodi.table.retention.class=com.foo.bar.TimestampBasedKeyRetention and provide 
it an option to keep let say last 30 days of data. Something like it.

> Delete partition support in HoodieDeltaStreamer 
> ------------------------------------------------
>
>                 Key: HUDI-2189
>                 URL: https://issues.apache.org/jira/browse/HUDI-2189
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: deltastreamer
>            Reporter: Samrat Deb
>            Assignee: sivabalan narayanan
>            Priority: Critical
>             Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to