flashJd opened a new pull request, #6324:
URL: https://github.com/apache/hudi/pull/6324
when we incremental query a hudi table, if
// 1. there are files in metadata be deleted;
// 2. read from earliest
// 3. the start commit is archived
// 4. the end commit is archived
this query will turns to a fullTableScan.
In this condition, the _endInstant_ parameter in getInputSplits() will
be the latest
instance, cause to scan the latest fileSlice(which may be larger as time
goes by) and then open it and filter the record using instantRange.
Considering a query scenario, read.start-commit is archived and
read.end-commit is in activeTimeLine, this is a fullTableScan. But we can set
the _endInstant_ parameter to read.end-commit, not the lastest instance, so as
to read less data, more over, if there is an upsert between read.end-commit and
the lastest instance, if we use lastest instance as _endInstant_, we will lose
the insert data between read.start-commit and read.end-commit(the data is
upserted, so the original data is missing in the lastest instance).
Considering another query scenario, read.start-commit is archived and
read.end-commit is also archived, this is a fullTableScan. if read.end-commit
is long along and be cleaned, but there is savepoint after it, we can use this
savepoint to incremental query the table, not care about the data inserted or
upserted after the savepoint.
The core idea is making the searching fileSlice adjacent to
read.end-commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]