chsanjeev commented on issue #59795:
URL: https://github.com/apache/airflow/issues/59795#issuecomment-3757513954
@o-nikolas happy new year. Thanks for looking into it.
I saw your presentation on AWS Lambda execute, thank you for your
contributions. I see a lot of potential for executing the short lived tasks as
scale without overwhelming airflow. I tested it out, I am blown away with the
scale.
Coming to this issue..
I thought about two options when I posted the issue,
1. Increase the length of the column
Approach:
My thoughts are to make it as a TEXT field or some 10_000 for two reasons;
- I dont see that this column is used for any joins or used as
referential/primary key, so making it text has no implication on the metadata
querying performance.
- If we limit the conversation just to AWS lambda executor, there are high
chances that the external id set by the executor will be more than 1000
charcters.
Pros:
- provides better visibility into the task details as you said human
readable format.
- Audit and oberservability, given that external executors like lambda
live only for the amount of task, having human readable format will people to
build custom reporting on the task metdata if needed.
Cons:
- This will increase the size of the database, if people only wants to
use external executors ( like AWS lambda which is highly scaled at least with
what i have noticed)
- In future, this approach have significant performance issues if
incase the community decided use the external id column as a referential column
or an alternate key or two join two tables.
2. Hash or UUID approach:
Approach:
Hash the external id based on the text or even UUID.
Pros:
- No changes to existing metadata database.
- No Additional storage footprint.
- Allows us to use in the joining conditions or any key on the table.
Cons:
- This will force every external executor provider to create a hash or
UUID with an agreed approached ( ssh/uuid approach) for consistency across the
ecosystem.
- We loose the human regality and the adhoc reporting capability
- Overlapping of hash key concerns in case of highly usage platforms
My vote is to go with “Increase the length of the column”. But I am open to
option 2 as well.
Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]