[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

via GitHub Sun, 05 Feb 2023 19:26:16 -0800


nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418452730


   sure @kazdy . that would be awesome. but I am curious of how you plan to fix 
the issue though. In streaming read, user might want to get all incremental 
changes. from what I see, this is nothing but an incremental query on a hudi 
table. w/ incremental query, we do have fallback mechanism via 
`hoodie.datasource.read.incr.fallback.fulltablescan.enable`. 
   
   But in streaming read, the amount of data read might spike up(if we plan to 
do the same) and the user may not have provisioned higher resources for the 
job. 
   
   I am thinking, if we should add something like `auto.offset.reset` we have 
in kafka. If you know if we have something similar in streaming read from spark 
itself, we can leverage the same or add a new config in hoodie. 
   
   So, users can configure what they want to do in such cases:
   1. whether they wish to resume reading from earliest valid commit from hudi. 
      // impl might be involved. since we need to dedect the commit which 
hasn't been cleaned by the cleaner yet. 
   3. Or do snapshot query w/ latest table state. 
   4. Fail the streaming read. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Reply via email to