pratyakshsharma edited a comment on issue #1362: HUDI-644 Enable user to get 
checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594409237
 
 
   Let me put forward my viewpoint on this. When I was in the phase of adopting 
Hudi, I kept my already running pipeline writing to some path and started 
DeltaStreamer to write to some other path. Then I used to do validation 
everyday for some period of time to gain enough confidence on this framework 
before completely switching to Hudi. 
   
   Coming to your point of switching from Kafka -> HDFS raw parquet -> Hudi 
table to Kafka -> Hudi table, I was thinking of a similar use case some time 
back and the simplest thing I could think of was to support having checkpoints 
for Hudi dataset source wise. Currently we store checkpoint 
"deltastreamer.checkpoint.key" in .commit file and this variable stores 
checkpoint in a particular format for every source which creates problems when 
you try to switch your source for the same dataset. So I think if we could 
simply introduce more variables like this and each one of them will store 
checkpoints for their corresponding sources, this use case can be solved with 
minimal efforts. And yes this needs development cycle since what I am proposing 
is not supported as of now. WDYT? 
   
   Currently to handle such scenarios, we have 
"deltastreamer.checkpoint.reset_key" configurable for every DeltaStreamer run 
and you can do hacks around these two variables ("deltastreamer.checkpoint.key" 
and "deltastreamer.checkpoint.reset_key") to get your use case solved but a 
clean solution should be what I proposed above. The above solution works well 
in cases where you want to switch sources quite frequently also.
   
   Also would like to hear from @leesf and @vinothchandar on this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to