Github user mattf commented on the pull request:
https://github.com/apache/spark/pull/2471#issuecomment-58885115
> @mattf I understand what you're trying to say, but think about it in
context. As I said above, the "when to poll the file system" code is the most
trivial part of this change. The only advantage of using cron for that is that
you'd have more scheduling options - e.g., absolute times instead of a period.
>
> To achieve that, you'd be considerably complicating everything else.
You'd be creating a new command line tool in Spark, that needs to deal with
command line arguments, be documented, and handle security settings (e.g.
kerberos) - so it's more burden for everybody, maintaners of the code and
admins alike.
>
> And all that for a trivial, and I'd say, not really needed gain in
functionality.
@aw-altiscale pointed me to camus which has a nearly separable component:
https://github.com/linkedin/camus/tree/master/camus-sweeper
my objection to this is about the architecture and responsibilities of the
spark components. i don't object to having the functionality.
i think you should implement the ability to sweep/rotate/clean log files in
hdfs, but not as part of a spark process.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]