flashJd opened a new pull request, #6324:
URL: https://github.com/apache/hudi/pull/6324

       when we incremental query a hudi table, if    
   //   1. there are files in metadata be deleted;
   //   2. read from earliest
   //   3. the start commit is archived
   //   4. the end commit is archived
   this query will turns to a fullTableScan.
        In this condition, the _endInstant_ parameter in getInputSplits() will 
be the latest 
   instance, cause to scan the latest fileSlice(which may be larger as time 
goes by) and then open it and filter the record using instantRange.
        Considering a query scenario, read.start-commit is archived and 
read.end-commit is in activeTimeLine, this is a fullTableScan. But we can set 
the _endInstant_ parameter to read.end-commit, not the lastest instance, so as 
to read less data, more over, if there is an upsert between read.end-commit and 
the lastest instance, if we use lastest instance as _endInstant_, we will lose 
the insert data between read.start-commit and read.end-commit(the data is 
upserted, so the original data is missing in the lastest instance).
       Considering another query scenario, read.start-commit is archived and 
read.end-commit is also archived, this is a fullTableScan. if read.end-commit 
is long along and be cleaned, but there is savepoint after it,  we can use this 
savepoint to incremental query the table, not care about the data inserted or 
upserted after the savepoint.
       The core idea is making the searching fileSlice adjacent to 
read.end-commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to