Oh yes -- I'd been tested on ZSH which handles this without me having
to care about it, but bash doesn't.
I thought `_` is valid in dag/task ids right now?
S3 is fine with `=` in the object path, and I don't think URI concern
matters much as I can't think of any cases where a user is going to
generate URLs that contain the `=`.
I can see advantages and dis-advantages to either way. What are other
people's thoughts?
Good point, and changing the template is now easy so we can change this
"at any point" (but ideally before 2.3 is in beta/rc) which is why I
felt happy merging my PR.
-ash
On Tue, Feb 15 2022 at 18:28:02 +0100, Jarek Potiuk <[email protected]>
wrote:
Sorry for the late comment - I just got to that one. I am ok with the
change, I might have one small comment that is worth considering (but
not a blocker).
The "=" sign has special meaning in a number of shells and requires
quoting and when referred to via URL it requires %-encoding - it's
one of the RFC 3986 section 2.2 Reserved Characters
This is not a "huge" issue but it makes it a bit easier for any kind
of manual, debugging, listing etc to use characters that do not
require either.
I looked at allowed task_id and dag_ids:
KEY_REGEX = re.compile(r'^[\w.-]+$')
GROUP_KEY_REGEX = re.compile(r'^[\w-]+$')
and list of unreserved RFC3986 characters
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ . ~
And it seems to be that either _ or ~ would be a better choice for
separator (especially the ~ I think makes very close resemblance to =)
drwxrwxr-x 3 jarek jarek 4096 Feb 15 18:24 .
drwxr-xr-x 77 jarek jarek 16384 Feb 13 17:29 ..
-rw-rw-r-- 1 jarek jarek 0 Feb 15 18:16 'a=b'
-rw-rw-r-- 1 jarek jarek 0 Feb 15 18:24 a_b
-rw-rw-r-- 1 jarek jarek 0 Feb 15 18:21 a~b
But this is just a suggestion - I am also ok with '=' but we have to
realise it makes it slightly more annoying for the users.
J.
On Thu, Feb 10, 2022 at 5:54 PM Ash Berlin-Taylor <[email protected]
<mailto:[email protected]>> wrote:
Hi all,
As part of AIP-42 we need to split out the task instance log files
to include the "map index". The simple way of doing this would be
just to add another sub-folder, so for instance, we'd have
{{ dag_id }}/{{ task_id }}/{{ logical_date }}/{{ map_index }}/{{
try_number }}.log
maptest/consumer/2022-02-09T00:00:00+00:00/0/1.log
Which is getting "deep" and hard to know what each component is. So
Daniel Standish suggested that maybe we could use the "hive
partition style", so like this:
dag={{ dag_id }}/task_id={{ task_id }}/logical_date={{
logical_date }}/map_index={{ map_index }}/attempt={{ try_number
}}.log
dag=maptest/task_id=consumer/logical_date=backfill__2022-02-09T00:00:00+00:00/map_index=0/attempt=1.log
That also made me realise that the "order" of the components is
mixed (as task_id is "smaller" than date") so think a better order
would be
dag={{ dag_id }}/logical_date={{ logical_date }}/task_id={{
task_id }}/map_index={{ map_index }}/attempt={{ try_number }}.log
(I haven't shown it here but the actual template only includes
map_index for mapped tasks.)
For example the logs folder would have these sort of files in them
(showing a dag with both a mapped and an normal task)
$ tree ~/airflow/logs
/home/ash/airflow/logs
├── dag=maptest
│ ├── run_id=backfill__2022-02-10T00:00:00+00:00
│ │ ├── task_id=consumer
│ │ │ ├── map_index=0
│ │ │ │ └── attempt=1.log
│ │ │ ├── map_index=1
│ │ │ │ └── attempt=1.log
│ │ │ └── map_index=2
│ │ │ └── attempt=1.log
│ │ └── task_id=make_list
│ │ └── attempt=1.log
└── scheduler
├── 2022-02-10
└── latest -> /home/ash/airflow/logs/scheduler/2022-02-10
We've (well TP) has already handled the historic problem where
changing the log template made old logs unviewable in the UI so that
isn't an issue anymore, so we can change the template as we like.
I'm asking for lazy consensus to change the log_filename_template to:
- use run_id instead of logical_date
- use key=value hive partition style.
- move run_id (in place of date) to before task_id
I'm hoping to merge this PR soon (this week) but if we don't get
consensus on the above by Tuesday morning European time we can
change the template back.
I have this done in a draft PR here
<https://github.com/apache/airflow/pull/21495/files>
Thanks,
Ash