Marcelo Vanzin commented on SPARK-23607:

I think this is a nice trick to speed things up, even though it only works for 
HDFS. I have some ideas on how to have a more generic speed up in this code, 
just haven't had the time to sit down and try them out, but this could help in 
the meantime.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> ---------------------------------------------------------------------------------------------------------
>                 Key: SPARK-23607
>                 URL: https://issues.apache.org/jira/browse/SPARK-23607
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Web UI
>    Affects Versions: 2.3.0
>            Reporter: Ye Zhou
>            Priority: Major
>             Fix For: 2.4.0
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to