Github user uncleGen commented on the issue:
https://github.com/apache/spark/pull/16142
@vanzin
> For the feature, it feels like it's trying to make the SHS more like a
"log management system" than a history server.
Sorry, I do not get it. I just add a new clean-up mode, but not add the
cleaner itself. One is based on age, and the another one is base on space
usage. That's all.
> But I'll entertain the thought, even though you can probably get pretty
close to this by using time-based deletion with a shorter max age, coupled with
log compression (both features that already exist).
I think you may not get what I mean. You can set the `max age` a very very
small value, and use compression or whatever. Yes, you can use few space
finally. But, this way will delete many latest job event logs, and then we can
not review job history. Because, we use a wrong way to achieve restricting
space usage.
> For the implementation, you cannot delete things just based on size. You
need to account for time too; you have to delete older logs first, otherwise
you risk deleting the logs for just finished applications instead of a large
log that's been sitting there for months.
I think current `space` mode is deleting logs base on space usage and
oldest file first. If you do not think so, there may be something wrong in my
implementation. I will check it.
> You change will also bombard the NameNode with requests on every scan, to
get the size of each log.
The `cleanLog` will be called in every `CLEAN_INTERNAL_S` (1 day default),
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L218
If `CLEAN_INTERNAL_S` is set as a very very small value, current
implementation has hurt NameNode already. So, I do not think `get the size of
each log` will hurt NameNode greatly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]