Re: Changing default log file template (lazy consensus)

Jarek Potiuk Tue, 15 Feb 2022 10:39:51 -0800

Yeah. No problem with merging. We can indeed change it and maybe even it is
good it is merged now with "=" because we might see ourselves how "really"
annoying it is. There is nothing like a good experiment. Maybe in a few
weeks someone will say "I can't stand this '=' any more - let's change it
:)".


One more thought.

Technically the separator does not have to be "forbidden" in the id. If the
prefixes are fixed \task_id=', 'dag_id=', the 'separator' might be
anything we want including . - as well because it is not a separator - it's
part of the prefix. It might make it slightly more complex to split it, but
it isn't that difficult. I think that is important because we discussed
that we want to allow any characters for ids - and technically it does not
block us. The only "really" difficult character in this context for ids
would be '/'.

J.


On Tue, Feb 15, 2022 at 7:03 PM Ash Berlin-Taylor <[email protected]> wrote:

> Oh yes -- I'd been tested on ZSH which handles this without me having to
> care about it, but bash doesn't.
>
> I thought `_` is valid in dag/task ids right now?
>
> S3 is fine with `=` in the object path, and I don't think URI concern
> matters much as I can't think of any cases where a user is going to
> generate URLs that contain the `=`.
>
> I can see advantages and dis-advantages to either way. What are other
> people's thoughts?
>
> Good point, and changing the template is now easy so we can change this
> "at any point" (but ideally before 2.3 is in beta/rc) which is why I felt
> happy merging my PR.
>
> -ash
>
> On Tue, Feb 15 2022 at 18:28:02 +0100, Jarek Potiuk <[email protected]>
> wrote:
>
> Sorry for the late comment - I just got to that one. I am ok with the
> change, I might have one small comment that is worth considering (but not a
> blocker).
> The "=" sign has special meaning in a number of shells and requires
> quoting and when referred to via URL it requires %-encoding - it's one of
> the RFC 3986 section 2.2 Reserved Characters
>
> This is not a "huge" issue but it makes it a bit easier for any kind of
> manual, debugging, listing etc to use characters that do not require either.
>
> I looked at allowed task_id and dag_ids:
>
> KEY_REGEX = re.compile(r'^[\w.-]+$')
> GROUP_KEY_REGEX = re.compile(r'^[\w-]+$')
>
> and list of unreserved RFC3986 characters
>
> A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> a b c d e f g h i j k l m n o p q r s t u v w x y z
> 0 1 2 3 4 5 6 7 8 9 - _ . ~
>
> And it seems to be that either _ or ~ would be a better choice for
> separator (especially the ~ I think makes very close resemblance to =)
>
> drwxrwxr-x  3 jarek jarek  4096 Feb 15 18:24  .
> drwxr-xr-x 77 jarek jarek 16384 Feb 13 17:29  ..
> -rw-rw-r--  1 jarek jarek     0 Feb 15 18:16 'a=b'
> -rw-rw-r--  1 jarek jarek     0 Feb 15 18:24  a_b
> -rw-rw-r--  1 jarek jarek     0 Feb 15 18:21  a~b
>
>
> But this is just a suggestion - I am also ok with '=' but we have to
> realise it makes it slightly more annoying for the users.
>
> J.
>
>
>
>
>
>
> On Thu, Feb 10, 2022 at 5:54 PM Ash Berlin-Taylor <[email protected]> wrote:
>
>> Hi all,
>>
>> As part of AIP-42 we need to split out the task instance log files to
>> include the "map index". The simple way of doing this would be just to add
>> another sub-folder, so for instance, we'd have
>>
>>   {{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{
>> try_number }}.log
>>     maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log
>>
>> Which is getting "deep" and hard to know what each component is. So
>> Daniel Standish suggested that maybe we could use the "hive partition
>> style", so like this:
>>
>>   dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{ logical_date
>> }}/map_index={{ map_index }}/attempt={{ try_number }}.log
>>
>> dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log
>>
>> That also made me realise that the "order" of the components is mixed (as
>> task_id is "smaller" than date") so think a better order would be
>>
>>   dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{ task_id
>> }}/map_index={{ map_index }}/attempt={{ try_number }}.log
>>
>> (I haven't shown it here but the actual template only includes map_index
>> for mapped tasks.)
>>
>> For example the logs folder would have these sort of files in them
>> (showing a dag with both a mapped and an normal task)
>>
>> $ tree ~/airflow/logs /home/ash/airflow/logs ├── dag=maptest │   ├──
>> run_id=backfill__2022-02-10T00:00:00+00:00 │   │   ├── task_id=consumer │
>> │   │   ├── map_index=0 │   │   │   │   └── attempt=1.log │   │   │   ├──
>> map_index=1 │   │   │   │   └── attempt=1.log │   │   │   └── map_index=2
>> │   │   │   └── attempt=1.log │   │   └── task_id=make_list │   │   └──
>> attempt=1.log └── scheduler ├── 2022-02-10 └── latest ->
>> /home/ash/airflow/logs/scheduler/2022-02-10
>>
>> We've (well TP) has already handled the historic problem where changing
>> the log template made old logs unviewable in the UI so that isn't an issue
>> anymore, so we can change the template as we like.
>>
>> I'm asking for lazy consensus to change the log_filename_template to:
>> - use run_id instead of logical_date
>> - use key=value hive partition style.
>> - move run_id (in place of date) to before task_id
>>
>> I'm hoping to merge this PR soon (this week) but if we don't get
>> consensus on the above by Tuesday morning European time we can change the
>> template back.
>>
>> I have this done in a draft PR here
>> https://github.com/apache/airflow/pull/21495/files
>>
>> Thanks,
>> Ash
>>
>

Re: Changing default log file template (lazy consensus)

Reply via email to