Github user vanzin commented on the pull request:
https://github.com/apache/spark/pull/5886#issuecomment-99537802
Ok, I buy that scenario. But as I mentioned before, the current solution is
not very good: it increases the memory usage of the history server too much. If
you have `n` log files, now during polling your HS will need order of `3 * n`
memory (instead of the current `2 * n`, which is not great but we shouldn't
make it worse).
My suggestion: before doing a `listStatus`, create an empty file in the log
directory and retrieve its mod time. Use that as `newLastModifiedTime`. So you
don't need to keep another map with every single log file available. Next time
you poll, any modifications that happen during the `listStatus` call would be
caught since `lastModificationTime` is guaranteed to be before anything that
happened during the `listStatus`. (And `lastModificationTime` basically becomes
`lastPollTime`.)
That may exacerbate the issue raised in SPARK-7189, though.
> I realize that some log file doesn't get processed until a bit later
That's a good question. Because if the mod time changes, then the file is
being written to, and its mod time will eventually change again. But I guess
the same race can occur if the file is closed / renamed during an app shutdown,
while the HS is doing the listing?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]