Sorry for the late comment - I just got to that one. I am ok with the
change, I might have one small comment that is worth considering (but not a
blocker).
The "=" sign has special meaning in a number of shells and requires quoting
and when referred to via URL it requires %-encoding - it's one of the RFC
3986 section 2.2 Reserved Characters

This is not a "huge" issue but it makes it a bit easier for any kind of
manual, debugging, listing etc to use characters that do not require either.

I looked at allowed task_id and dag_ids:

KEY_REGEX = re.compile(r'^[\w.-]+$')
GROUP_KEY_REGEX = re.compile(r'^[\w-]+$')

and list of unreserved RFC3986 characters

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ . ~

And it seems to be that either _ or ~ would be a better choice for
separator (especially the ~ I think makes very close resemblance to =)

drwxrwxr-x  3 jarek jarek  4096 Feb 15 18:24  .
drwxr-xr-x 77 jarek jarek 16384 Feb 13 17:29  ..
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:16 'a=b'
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:24  a_b
-rw-rw-r--  1 jarek jarek     0 Feb 15 18:21  a~b


But this is just a suggestion - I am also ok with '=' but we have to
realise it makes it slightly more annoying for the users.

J.






On Thu, Feb 10, 2022 at 5:54 PM Ash Berlin-Taylor <[email protected]> wrote:

> Hi all,
>
> As part of AIP-42 we need to split out the task instance log files to
> include the "map index". The simple way of doing this would be just to add
> another sub-folder, so for instance, we'd have
>
>   {{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{
> try_number }}.log
>     maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log
>
> Which is getting "deep" and hard to know what each component is. So Daniel
> Standish suggested that maybe we could use the "hive partition style", so
> like this:
>
>   dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{ logical_date
> }}/map_index={{ map_index }}/attempt={{ try_number }}.log
>
> dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log
>
> That also made me realise that the "order" of the components is mixed (as
> task_id is "smaller" than date") so think a better order would be
>
>   dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{ task_id
> }}/map_index={{ map_index }}/attempt={{ try_number }}.log
>
> (I haven't shown it here but the actual template only includes map_index
> for mapped tasks.)
>
> For example the logs folder would have these sort of files in them
> (showing a dag with both a mapped and an normal task)
>
> $ tree ~/airflow/logs /home/ash/airflow/logs ├── dag=maptest │   ├──
> run_id=backfill__2022-02-10T00:00:00+00:00 │   │   ├── task_id=consumer │
> │   │   ├── map_index=0 │   │   │   │   └── attempt=1.log │   │   │   ├──
> map_index=1 │   │   │   │   └── attempt=1.log │   │   │   └── map_index=2
> │   │   │   └── attempt=1.log │   │   └── task_id=make_list │   │   └──
> attempt=1.log └── scheduler ├── 2022-02-10 └── latest ->
> /home/ash/airflow/logs/scheduler/2022-02-10
>
> We've (well TP) has already handled the historic problem where changing
> the log template made old logs unviewable in the UI so that isn't an issue
> anymore, so we can change the template as we like.
>
> I'm asking for lazy consensus to change the log_filename_template to:
> - use run_id instead of logical_date
> - use key=value hive partition style.
> - move run_id (in place of date) to before task_id
>
> I'm hoping to merge this PR soon (this week) but if we don't get consensus
> on the above by Tuesday morning European time we can change the template
> back.
>
> I have this done in a draft PR here
> https://github.com/apache/airflow/pull/21495/files
>
> Thanks,
> Ash
>

Reply via email to