[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

steveloughran Tue, 18 Sep 2018 03:38:10 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    I see the reasoning here
    
    * @jianjianjiao has a very large cluster with many thousands of history 
files of past (successful) jobs.
    * history server startup needs to go through all these logs before being 
usable, so any server restart results in hours of downtime, just from scanning.
    * this patch breaks things up to be incremental.
    
    I don't have any opinions on the patch itself; I've not looked at that code 
for so long my reviews are probably dangerous.
    
    Two thought: 
    
    1. would it make sense for the initial scans to go for the most recent logs 
first, because that 2.5 hour time to scan all files is still there. 
    1. would you want the UI and rest api to indicate that the scan was still 
in progress, and not to worry if the listing was incomplete?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Reply via email to