[GitHub] spark pull request #22444: [SPARK-25409][Core]Speed up Spark History loading...

jianjianjiao Mon, 17 Sep 2018 20:56:44 -0700

Github user jianjianjiao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22444#discussion_r218292773
  
    --- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -465,20 +475,31 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
                 }
               } catch {
                 case _: NoSuchElementException =>
    -              // If the file is currently not being tracked by the SHS, 
add an entry for it and try
    -              // to parse it. This will allow the cleaner code to detect 
the file as stale later on
    -              // if it was not possible to parse it.
    -              listing.write(LogInfo(entry.getPath().toString(), 
newLastScanTime, None, None,
    -                entry.getLen()))
    --- End diff --
    
    Hi, @squito  thanks for looking into this PR.
    
    When Spark history starts, it will scan event logs folder, and using 
multi-threads to handle. it will not do next scan before the first finishes.  
That is the problem, in our cluster, there are about 20K event-log files(often 
bigger than 1G), including like 1K .inprogress files, it takes about 2 and a 
half hours to do the first scan. that means, during this 2.5 hours, if an user 
submit a spark application, and it finishes, user cannot find it via the spark 
history UI, and has to wait for the next scan.
    
    That is why I add a limit of how much to scan each time, like set to 3K.  
That means no matter how many log files in the event-logs folder, it will first 
scan the first 3K and handle them, and then do the second scan, let's assume 
that during the first scan, there are 5 applications scanned, and there are 
another 10 applications updated. then the second scan will handle these 15 
applications and another 2885 files ( from 3001 to 5885) in the event folder. 
    
     checkForLogs scan event-log folders, and only handles files that are 
updated or not handled.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22444: [SPARK-25409][Core]Speed up Spark History loading...

Reply via email to