Re: Airflow Logging Improvements

Bolke de Bruin Thu, 22 Jun 2017 00:06:00 -0700

In the light of fixing logging, I would definitely appreciate written design. 
Especially, as there have been multiple attempts to fix some issues but these 
have been more like stop gap fixes.


In my opinion Airflow should not stipulate in a hard coded fashion where and 
how logging takes place. It should behave more like ‘log4j’ configurations. So 
it should not just use “dag_id + task+id + execution_date” and write this to an 
arbitrary location on the filesystem. I could imagine a settings file 
“logging.conf” that setups something like this:

[logger_scheduler]
level = INFO
handler = stderr
qualname = airflow.scheduler
formatter=scheduler_formatter

In airflow.cfg it should allow setting something like this:

[scheduler]
use_syslog = True
syslog_log_facility = LOG_LOCAL0

To allow logging to syslog so it can be moved to a centralised location if 
required (syslog being a special case afaik).

Elasticsearch and any other backend can then just be a handler and we can 
remove the custom stuff that is proposed in PR 
https://github.com/apache/incubator-airflow/pull/2380 
<https://github.com/apache/incubator-airflow/pull/2380> by 
https://github.com/cmanaha/python-elasticsearch-logger 
<https://github.com/cmanaha/python-elasticsearch-logger> for example.

I then can be convinced to add something like “attempt”, but probably there are 
more friendly ways to solve it at that time. In addition ‘attempts' should then 
imho not be managed by the task or cli, but rather by the executor as that is 
the process which “attempts” a task. 

Bolke.


> On 22 Jun 2017, at 01:21, Dan Davydov <[email protected]> wrote:
> 
> Responding to some of Bolke's concerns in the github PR for this change:
> 
> > Mmm still not convinced. Especially on elastic search it is just easier to 
> > use the start_date to shard on.
> sharding on start_date isn't great because there is still some risk of 
> collisions and it means that we are coupling the primary key with start_date 
> unnecessarily (e.g. hypothetically you could allow two tasks to run at the 
> same in Airflow and in this case start_date would no longer be a valid 
> primary key), using monotonically increasing IDs for DB entries like this is 
> pretty standard practice.
> 
> > In addition I'm very against the managing of log files this way. Log files 
> > are already a mess and should be refactored to be consistent and to be 
> > managed from one place.
> I agree about the logging mess, and there seem to have been efforts 
> attempting to fix this but they have all been abandoned so we decided to move 
> ahead with this change. I need to take a look at the PR first, but this 
> change should actually make logging less messy, since it should add an 
> abstraction for logging modules, and because you know exactly which try 
> numbers (and how many) ran on which workers from the file path. The log 
> folder structure already kind of mimicked the primary key of the 
> task_instance table (dag_id + task_id + execution_date), but really 
> try_number logically belongs in this key as well (at least for the key for 
> log files). 
> 
> > The docker packagers can already not package airflow correctly without 
> > jumping through hoops. Arbitrarily naming it certainly does not help here.
> If this is referring to the /<ATTEMPT #>/ in the path, I don't think this is 
> arbitrarily naming it. A log "unit" really should be a single task run (not 
> an arbitrary grouping of a variable number of multiple runs), and each unit 
> should have a unique key or location. One of the reasons we are working on 
> this effort is to actually make Airflow play nicer with Kubernetes/Docker 
> (since airflow workers should ideally be ephemeral), and allowing a separate 
> service to read and ship the logs is necessary in this case since the logs 
> will be destroyed along with the worker instance. I think in the future we 
> should also allow custom logging modules (e.g. directly writing logs to some 
> service).
> 
> 
> On Wed, Jun 21, 2017 at 3:11 PM, Allison Wang <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> I am in the process of making airflow logging backed by Elasticsearch (more 
> detail please check AIRFLOW-1325 
> <https://issues.apache.org/jira/browse/AIRFLOW-1325>). Here are several more 
> logging improvements we are considering:
> 
> 1. Log streaming. Auto-refresh the logs if tasks are running.
> 
> 2. Separate logs by attempts.
> 
> Instead of logging everything into one file, logs can be separated by attempt 
> number and displayed using tabs. Attempt number here is a monotonically 
> increasing number that represents each task instance run (unlike try_number, 
> clear task instance won't reset attempt number).
> try_number: n^th retry by the task instance. try_number should not be greater 
> than retries. Clear task will set try_number to 0.
> attempt: number of times current task instance got executed. 
> 
> 3. Collapsable logs. Collapse logs that are mainly for debugging airflow 
> internal and aren't really related to users' tasks (for example, logs showed 
> before "starting attempt 1 of 1")
> 
> All suggestions are welcome.
> 
> Thanks,
> Allison 
>

Re: Airflow Logging Improvements

Reply via email to