LuciferYang commented on PR #54575:
URL: https://github.com/apache/spark/pull/54575#issuecomment-4002421656
I found one potential correctness issue:
In `FsHistoryProvider`, `AttemptInfoWrapper` currently stores
`attempt.logPath` as basename (`reader.rootPath.getName()`), and later
operations resolve it via `resolveLogPath(attempt.logPath)` by scanning
configured log dirs and returning the first match.
This may be ambiguous when two configured directories contain the same event
log basename (e.g. after migration/copy or multi-cluster aggregation,although
the probability is very low.). In that case, download/rebuild/cleanup may
operate on a different physical file than the one originally indexed.
For example, the following test case:
```
test("same log file name across directories resolves incorrectly") {
val dir2 = Utils.createTempDir(namePrefix = "logDir2")
try {
val conf = createTestConf()
.set(HISTORY_LOG_DIR,
s"${testDir.getAbsolutePath},${dir2.getAbsolutePath}")
val provider = new FsHistoryProvider(conf)
val collidingLogName = "shared-event-log"
val log1 = new File(testDir, collidingLogName)
writeFile(log1, None,
SparkListenerApplicationStart("app1", Some("app1-id"), 1L, "test",
None),
SparkListenerApplicationEnd(5L))
val log2 = new File(dir2, collidingLogName)
writeFile(log2, None,
SparkListenerApplicationStart("app2", Some("app2-id"), 2L, "test",
None),
SparkListenerApplicationEnd(6L))
updateAndCheck(provider) { list =>
list.size should be (2)
list.map(_.id).toSet should be (Set("app1-id", "app2-id"))
}
val attempt = provider.getAttempt("app2-id", None)
attempt.logSourceFullPath should be (dir2.getAbsolutePath)
val resolveLogPath = PrivateMethod[(FileSystem,
Path)](Symbol("resolveLogPath"))
val (_, resolvedPath) = provider invokePrivate
resolveLogPath(attempt.logPath)
resolvedPath.toUri.getPath should be (log2.getAbsolutePath)
provider.stop()
} finally {
Utils.deleteRecursively(dir2)
}
}
```
In a scenario where multiple log directories are configured in SHS and there
are event log files with the same name in different directories, I expect the
log path of app2 to be resolved to the second directory where it actually
resides. However, in practice, `resolveLogPath` matches the file with the same
name in the first directory first, resulting in an incorrect path resolution.
So can we assume that filenames are globally unique across configured
directories?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]