nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418452730
sure @kazdy . that would be awesome. but I am curious of how you plan to fix
the issue though. In streaming read, user might want to get all incremental
changes. from what I see, this is nothing but an incremental query on a hudi
table. w/ incremental query, we do have fallback mechanism via
`hoodie.datasource.read.incr.fallback.fulltablescan.enable`.
But in streaming read, the amount of data read might spike up(if we plan to
do the same) and the user may not have provisioned higher resources for the
job.
I am thinking, if we should add something like `auto.offset.reset` we have
in kafka. If you know if we have something similar in streaming read from spark
itself, we can leverage the same or add a new config in hoodie.
So, users can configure what they want to do in such cases:
1. whether they wish to resume reading from earliest valid commit from hudi.
// impl might be involved. since we need to dedect the commit which
hasn't been cleaned by the cleaner yet.
3. Or do snapshot query w/ latest table state.
4. Fail the streaming read.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]