[ 
https://issues.apache.org/jira/browse/HUDI-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kazdy reassigned HUDI-5707:
---------------------------

    Assignee: kazdy

> Support offset reset strategy w/ spark streaming read from hudi table
> ---------------------------------------------------------------------
>
>                 Key: HUDI-5707
>                 URL: https://issues.apache.org/jira/browse/HUDI-5707
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: reader-core
>            Reporter: sivabalan narayanan
>            Assignee: kazdy
>            Priority: Major
>
> For users reading hudi table in a streaming manner, we need to support offset 
> reset strategy if the commit of interest it archived or cleaned up. 
>  
> notes from the issue 
> In streaming read, user might want to get all incremental changes. from what 
> I see, this is nothing but an incremental query on a hudi table. w/ 
> incremental query, we do have fallback mechanism via 
> {{{}hoodie.datasource.read.incr.fallback.fulltablescan.enable{}}}.
> But in streaming read, the amount of data read might spike up(if we plan to 
> do the same) and the user may not have provisioned higher resources for the 
> job.
> I am thinking, if we should add something like {{auto.offset.reset}} we have 
> in kafka. If you know if we have something similar in streaming read from 
> spark itself, we can leverage the same or add a new config in hoodie.
> So, users can configure what they want to do in such cases:
>  # whether they wish to resume reading from earliest valid commit from hudi.
> // impl might be involved. since we need to dedect the commit which hasn't 
> been cleaned by the cleaner yet.
>  # Or do snapshot query w/ latest table state.
>  # Fail the streaming read.
>  #  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to