[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

uncleGen Thu, 15 Dec 2016 17:49:44 -0800

Github user uncleGen commented on the issue:

    https://github.com/apache/spark/pull/16142
  
    @vanzin 
    
    > For the feature, it feels like it's trying to make the SHS more like a 
"log management system" than a history server. 
    
    Sorry, I do not get it. I just add a new clean-up mode, but not add the 
cleaner itself. One is based on age, and the another one is base on space 
usage. That's all.
    
    > But I'll entertain the thought, even though you can probably get pretty 
close to this by using time-based deletion with a shorter max age, coupled with 
log compression (both features that already exist).
    
    I think you may not get what I mean. You can set the `max age` a very very 
small value, and use compression or whatever. Yes, you can use few space 
finally. But, this way will delete many latest job event logs, and then we can 
not review job history. Because, we use a wrong way to achieve restricting 
space usage.
    
    > For the implementation, you cannot delete things just based on size. You 
need to account for time too; you have to delete older logs first, otherwise 
you risk deleting the logs for just finished applications instead of a large 
log that's been sitting there for months. 
    
    I think current `space` mode is deleting logs base on space usage and 
oldest file first. If you do not think so, there may be something wrong in my 
implementation. I will check it.
    
    > You change will also bombard the NameNode with requests on every scan, to 
get the size of each log.
    
    The `cleanLog` will be called in every `CLEAN_INTERNAL_S` (1 day default), 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L218
    If `CLEAN_INTERNAL_S` is set as a very very small value, current 
implementation has hurt NameNode already. So, I do not think `get the size of 
each log` will hurt NameNode greatly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16142: [SPARK-18716][CORE] Restrict the disk usage of spark eve...

Reply via email to