Hi all,

As part of AIP-42 we need to split out the task instance log files to include the "map index". The simple way of doing this would be just to add another sub-folder, so for instance, we'd have

{{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{ try_number }}.log
   maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log

Which is getting "deep" and hard to know what each component is. So Daniel Standish suggested that maybe we could use the "hive partition style", so like this:

dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{ logical_date }}/map_index={{ map_index }}/attempt={{ try_number }}.log dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log

That also made me realise that the "order" of the components is mixed (as task_id is "smaller" than date") so think a better order would be

dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{ task_id }}/map_index={{ map_index }}/attempt={{ try_number }}.log

(I haven't shown it here but the actual template only includes map_index for mapped tasks.)

For example the logs folder would have these sort of files in them (showing a dag with both a mapped and an normal task)

$ tree ~/airflow/logs
/home/ash/airflow/logs
├── dag=maptest
│   ├── run_id=backfill__2022-02-10T00:00:00+00:00
│   │   ├── task_id=consumer
│   │   │   ├── map_index=0
│   │   │   │   └── attempt=1.log
│   │   │   ├── map_index=1
│   │   │   │   └── attempt=1.log
│   │   │   └── map_index=2
│   │   │       └── attempt=1.log
│   │   └── task_id=make_list
│   │       └── attempt=1.log
└── scheduler
   ├── 2022-02-10
   └── latest -> /home/ash/airflow/logs/scheduler/2022-02-10

We've (well TP) has already handled the historic problem where changing the log template made old logs unviewable in the UI so that isn't an issue anymore, so we can change the template as we like.

I'm asking for lazy consensus to change the log_filename_template to:
- use run_id instead of logical_date
- use key=value hive partition style.
- move run_id (in place of date) to before task_id

I'm hoping to merge this PR soon (this week) but if we don't get consensus on the above by Tuesday morning European time we can change the template back.

I have this done in a draft PR here <https://github.com/apache/airflow/pull/21495/files>

Thanks,
Ash

Reply via email to