[
https://issues.apache.org/jira/browse/HUDI-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kazdy reassigned HUDI-5707:
---------------------------
Assignee: kazdy
> Support offset reset strategy w/ spark streaming read from hudi table
> ---------------------------------------------------------------------
>
> Key: HUDI-5707
> URL: https://issues.apache.org/jira/browse/HUDI-5707
> Project: Apache Hudi
> Issue Type: Improvement
> Components: reader-core
> Reporter: sivabalan narayanan
> Assignee: kazdy
> Priority: Major
>
> For users reading hudi table in a streaming manner, we need to support offset
> reset strategy if the commit of interest it archived or cleaned up.
>
> notes from the issue
> In streaming read, user might want to get all incremental changes. from what
> I see, this is nothing but an incremental query on a hudi table. w/
> incremental query, we do have fallback mechanism via
> {{{}hoodie.datasource.read.incr.fallback.fulltablescan.enable{}}}.
> But in streaming read, the amount of data read might spike up(if we plan to
> do the same) and the user may not have provisioned higher resources for the
> job.
> I am thinking, if we should add something like {{auto.offset.reset}} we have
> in kafka. If you know if we have something similar in streaming read from
> spark itself, we can leverage the same or add a new config in hoodie.
> So, users can configure what they want to do in such cases:
> # whether they wish to resume reading from earliest valid commit from hudi.
> // impl might be involved. since we need to dedect the commit which hasn't
> been cleaned by the cleaner yet.
> # Or do snapshot query w/ latest table state.
> # Fail the streaming read.
> #
--
This message was sent by Atlassian Jira
(v8.20.10#820010)