sivabalan narayanan created HUDI-5707:
-----------------------------------------

             Summary: Support offset reset strategy w/ spark streaming read 
from hudi table
                 Key: HUDI-5707
                 URL: https://issues.apache.org/jira/browse/HUDI-5707
             Project: Apache Hudi
          Issue Type: Improvement
          Components: reader-core
            Reporter: sivabalan narayanan


For users reading hudi table in a streaming manner, we need to support offset 
reset strategy if the commit of interest it archived or cleaned up. 

 

notes from the issue 

In streaming read, user might want to get all incremental changes. from what I 
see, this is nothing but an incremental query on a hudi table. w/ incremental 
query, we do have fallback mechanism via 
{{{}hoodie.datasource.read.incr.fallback.fulltablescan.enable{}}}.

But in streaming read, the amount of data read might spike up(if we plan to do 
the same) and the user may not have provisioned higher resources for the job.

I am thinking, if we should add something like {{auto.offset.reset}} we have in 
kafka. If you know if we have something similar in streaming read from spark 
itself, we can leverage the same or add a new config in hoodie.

So, users can configure what they want to do in such cases:
 # whether they wish to resume reading from earliest valid commit from hudi.
// impl might be involved. since we need to dedect the commit which hasn't been 
cleaned by the cleaner yet.
 # Or do snapshot query w/ latest table state.
 # Fail the streaming read.
 #  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to