Yeah. No problem with merging. We can indeed change it and maybe even it is good it is merged now with "=" because we might see ourselves how "really" annoying it is. There is nothing like a good experiment. Maybe in a few weeks someone will say "I can't stand this '=' any more - let's change it :)".
One more thought. Technically the separator does not have to be "forbidden" in the id. If the prefixes are fixed \task_id=', 'dag_id=', the 'separator' might be anything we want including . - as well because it is not a separator - it's part of the prefix. It might make it slightly more complex to split it, but it isn't that difficult. I think that is important because we discussed that we want to allow any characters for ids - and technically it does not block us. The only "really" difficult character in this context for ids would be '/'. J. On Tue, Feb 15, 2022 at 7:03 PM Ash Berlin-Taylor <[email protected]> wrote: > Oh yes -- I'd been tested on ZSH which handles this without me having to > care about it, but bash doesn't. > > I thought `_` is valid in dag/task ids right now? > > S3 is fine with `=` in the object path, and I don't think URI concern > matters much as I can't think of any cases where a user is going to > generate URLs that contain the `=`. > > I can see advantages and dis-advantages to either way. What are other > people's thoughts? > > Good point, and changing the template is now easy so we can change this > "at any point" (but ideally before 2.3 is in beta/rc) which is why I felt > happy merging my PR. > > -ash > > On Tue, Feb 15 2022 at 18:28:02 +0100, Jarek Potiuk <[email protected]> > wrote: > > Sorry for the late comment - I just got to that one. I am ok with the > change, I might have one small comment that is worth considering (but not a > blocker). > The "=" sign has special meaning in a number of shells and requires > quoting and when referred to via URL it requires %-encoding - it's one of > the RFC 3986 section 2.2 Reserved Characters > > This is not a "huge" issue but it makes it a bit easier for any kind of > manual, debugging, listing etc to use characters that do not require either. > > I looked at allowed task_id and dag_ids: > > KEY_REGEX = re.compile(r'^[\w.-]+$') > GROUP_KEY_REGEX = re.compile(r'^[\w-]+$') > > and list of unreserved RFC3986 characters > > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z > a b c d e f g h i j k l m n o p q r s t u v w x y z > 0 1 2 3 4 5 6 7 8 9 - _ . ~ > > And it seems to be that either _ or ~ would be a better choice for > separator (especially the ~ I think makes very close resemblance to =) > > drwxrwxr-x 3 jarek jarek 4096 Feb 15 18:24 . > drwxr-xr-x 77 jarek jarek 16384 Feb 13 17:29 .. > -rw-rw-r-- 1 jarek jarek 0 Feb 15 18:16 'a=b' > -rw-rw-r-- 1 jarek jarek 0 Feb 15 18:24 a_b > -rw-rw-r-- 1 jarek jarek 0 Feb 15 18:21 a~b > > > But this is just a suggestion - I am also ok with '=' but we have to > realise it makes it slightly more annoying for the users. > > J. > > > > > > > On Thu, Feb 10, 2022 at 5:54 PM Ash Berlin-Taylor <[email protected]> wrote: > >> Hi all, >> >> As part of AIP-42 we need to split out the task instance log files to >> include the "map index". The simple way of doing this would be just to add >> another sub-folder, so for instance, we'd have >> >> {{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{ >> try_number }}.log >> maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log >> >> Which is getting "deep" and hard to know what each component is. So >> Daniel Standish suggested that maybe we could use the "hive partition >> style", so like this: >> >> dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{ logical_date >> }}/map_index={{ map_index }}/attempt={{ try_number }}.log >> >> dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log >> >> That also made me realise that the "order" of the components is mixed (as >> task_id is "smaller" than date") so think a better order would be >> >> dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{ task_id >> }}/map_index={{ map_index }}/attempt={{ try_number }}.log >> >> (I haven't shown it here but the actual template only includes map_index >> for mapped tasks.) >> >> For example the logs folder would have these sort of files in them >> (showing a dag with both a mapped and an normal task) >> >> $ tree ~/airflow/logs /home/ash/airflow/logs ├── dag=maptest │ ├── >> run_id=backfill__2022-02-10T00:00:00+00:00 │ │ ├── task_id=consumer │ >> │ │ ├── map_index=0 │ │ │ │ └── attempt=1.log │ │ │ ├── >> map_index=1 │ │ │ │ └── attempt=1.log │ │ │ └── map_index=2 >> │ │ │ └── attempt=1.log │ │ └── task_id=make_list │ │ └── >> attempt=1.log └── scheduler ├── 2022-02-10 └── latest -> >> /home/ash/airflow/logs/scheduler/2022-02-10 >> >> We've (well TP) has already handled the historic problem where changing >> the log template made old logs unviewable in the UI so that isn't an issue >> anymore, so we can change the template as we like. >> >> I'm asking for lazy consensus to change the log_filename_template to: >> - use run_id instead of logical_date >> - use key=value hive partition style. >> - move run_id (in place of date) to before task_id >> >> I'm hoping to merge this PR soon (this week) but if we don't get >> consensus on the above by Tuesday morning European time we can change the >> template back. >> >> I have this done in a draft PR here >> https://github.com/apache/airflow/pull/21495/files >> >> Thanks, >> Ash >> >
