Oh yes -- I'd been tested on ZSH which handles this without me having to care about it, but bash doesn't.

I thought `_` is valid in dag/task ids right now?

S3 is fine with `=` in the object path, and I don't think URI concern matters much as I can't think of any cases where a user is going to generate URLs that contain the `=`.

I can see advantages and dis-advantages to either way. What are other people's thoughts?

Good point, and changing the template is now easy so we can change this "at any point" (but ideally before 2.3 is in beta/rc) which is why I felt happy merging my PR.

-ash

On Tue, Feb 15 2022 at 18:28:02 +0100, Jarek Potiuk <[email protected]> wrote:
Sorry for the late comment - I just got to that one. I am ok with the change, I might have one small comment that is worth considering (but not a blocker). The "=" sign has special meaning in a number of shells and requires quoting and when referred to via URL it requires %-encoding - it's one of the RFC 3986 section 2.2 Reserved Characters

This is not a "huge" issue but it makes it a bit easier for any kind of manual, debugging, listing etc to use characters that do not require either.

I looked at allowed task_id and dag_ids:

KEY_REGEX = re.compile(r'^[\w.-]+$')
GROUP_KEY_REGEX = re.compile(r'^[\w-]+$')

and list of unreserved RFC3986 characters

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ . ~

And it seems to be that either _ or ~ would be a better choice for separator (especially the ~ I think makes very close resemblance to =)

drwxrwxr-x  3 jarek jarek  4096 Feb 15 18:24  .
drwxr-xr-x 77 jarek jarek 16384 Feb 13 17:29  ..
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:16 'a=b'
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:24  a_b
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:21  a~b


But this is just a suggestion - I am also ok with '=' but we have to realise it makes it slightly more annoying for the users.

J.






On Thu, Feb 10, 2022 at 5:54 PM Ash Berlin-Taylor <[email protected] <mailto:[email protected]>> wrote:
Hi all,

As part of AIP-42 we need to split out the task instance log files to include the "map index". The simple way of doing this would be just to add another sub-folder, so for instance, we'd have

{{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{ try_number }}.log
    maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log

Which is getting "deep" and hard to know what each component is. So Daniel Standish suggested that maybe we could use the "hive partition style", so like this:

dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{ logical_date }}/map_index={{ map_index }}/attempt={{ try_number }}.log dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log

That also made me realise that the "order" of the components is mixed (as task_id is "smaller" than date") so think a better order would be

dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{ task_id }}/map_index={{ map_index }}/attempt={{ try_number }}.log

(I haven't shown it here but the actual template only includes map_index for mapped tasks.)

For example the logs folder would have these sort of files in them (showing a dag with both a mapped and an normal task)

$ tree ~/airflow/logs
/home/ash/airflow/logs
├── dag=maptest
│   ├── run_id=backfill__2022-02-10T00:00:00+00:00
│   │   ├── task_id=consumer
│   │   │   ├── map_index=0
│   │   │   │   └── attempt=1.log
│   │   │   ├── map_index=1
│   │   │   │   └── attempt=1.log
│   │   │   └── map_index=2
│   │   │       └── attempt=1.log
│   │   └── task_id=make_list
│   │       └── attempt=1.log
└── scheduler
    ├── 2022-02-10
    └── latest -> /home/ash/airflow/logs/scheduler/2022-02-10

We've (well TP) has already handled the historic problem where changing the log template made old logs unviewable in the UI so that isn't an issue anymore, so we can change the template as we like.

I'm asking for lazy consensus to change the log_filename_template to:
- use run_id instead of logical_date
- use key=value hive partition style.
- move run_id (in place of date) to before task_id

I'm hoping to merge this PR soon (this week) but if we don't get consensus on the above by Tuesday morning European time we can change the template back.

I have this done in a draft PR here <https://github.com/apache/airflow/pull/21495/files>

Thanks,
Ash

Reply via email to