Github user squito commented on the issue:
https://github.com/apache/spark/pull/22444
> history server startup needs to go through all these logs before being
usable, so any server restart results in hours of downtime, just from scanning.
I don't think this is true. The first scan may take a long time, but i
think the SHS is usable even during that time. As soon as a scan makes it
through some file, that file is added the listing.
But if I understand correctly, the advantage here is that as more
applications are run during that 2.5 hour scan, you will pick those up more
quickly.
> 1. would it make sense for the initial scans to go for the most recent
logs first, because that 2.5 hour time to scan all files is still there.
> 2. would you want the UI and rest api to indicate that the scan was still
in progress, and not to worry if the listing was incomplete?
I think both of these already happen.
@jianjianjiao again its been a while since I've looked at this code -- does
that sound correct?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]