suryaprasanna commented on code in PR #5341:
URL: https://github.com/apache/hudi/pull/5341#discussion_r871852934
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##########
@@ -218,7 +221,45 @@ protected synchronized void scanInternal(Option<KeySpec>
keySpecOpt) {
logFilePaths.stream().map(logFile -> new HoodieLogFile(new
Path(logFile))).collect(Collectors.toList()),
readerSchema, readBlocksLazily, reverseReader, bufferSize,
enableRecordLookups, keyField, internalSchema);
+ /**
+ * Traversal of log blocks from log files can be done in two directions.
+ * 1. Forward traversal
+ * 2. Reverse traversal
+ * For example: BaseFile, LogFile1(LogBlock11,LogBlock12,LogBlock13),
LofFile2(LogBlock21,LogBlock22,LogBlock23)
+ * Forward traversal look like,
+ * LogBlock11, LogBlock12, LogBlock13, LogBlock21, LogBlock22,
LogBlock23
+ * If we are considering reverse traversal including log blocks,
+ * LogBlock23, LogBlock22, LogBlock21, LogBlock13, LogBlock12,
LogBlock11
+ * Here, reverse traversal also traverses blocks in reverse order of
creation.
+ *
+ * 1. Forward traversal
+ * Forward traversal is easy to do in single writer mode. Where the
rollback block is right after the effected data blocks.
+ * With multiwriter mode the blocks can be out of sync. An example
scenario.
+ * B1, B2, B3, B4, R1(B3), B5
+ * In this case, rollback block R1 is invalidating the B3 which is not
the previous block.
+ * This becomes more complicated if we have compacted blocks, which are
data blocks created using log compaction.
+ * TODO: Include support for log compacted blocks.
https://issues.apache.org/jira/browse/HUDI-3580
+ *
+ * To solve this do traversal twice.
Review Comment:
Two traversals is needed to support the multiwriter scenarios where we can
have rollback way away from the original block it is targeting. With minor
compaction it becomes more tricky since we can have compacted blocks comprising
of other compacted blocks. So, tackling the multiwriter scenarios with this PR
first.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]