sivabalan narayanan created HUDI-5707:
-----------------------------------------
Summary: Support offset reset strategy w/ spark streaming read
from hudi table
Key: HUDI-5707
URL: https://issues.apache.org/jira/browse/HUDI-5707
Project: Apache Hudi
Issue Type: Improvement
Components: reader-core
Reporter: sivabalan narayanan
For users reading hudi table in a streaming manner, we need to support offset
reset strategy if the commit of interest it archived or cleaned up.
notes from the issue
In streaming read, user might want to get all incremental changes. from what I
see, this is nothing but an incremental query on a hudi table. w/ incremental
query, we do have fallback mechanism via
{{{}hoodie.datasource.read.incr.fallback.fulltablescan.enable{}}}.
But in streaming read, the amount of data read might spike up(if we plan to do
the same) and the user may not have provisioned higher resources for the job.
I am thinking, if we should add something like {{auto.offset.reset}} we have in
kafka. If you know if we have something similar in streaming read from spark
itself, we can leverage the same or add a new config in hoodie.
So, users can configure what they want to do in such cases:
# whether they wish to resume reading from earliest valid commit from hudi.
// impl might be involved. since we need to dedect the commit which hasn't been
cleaned by the cleaner yet.
# Or do snapshot query w/ latest table state.
# Fail the streaming read.
#
--
This message was sent by Atlassian Jira
(v8.20.10#820010)