Sounds like this may be a good candidate for an Airflow Improvement Plan...

On 20 September 2018 at 22:45, Brian Greene
<> wrote:
> Prior to using airflow for much, on first inspection, I think I may have 
> agreed with you.
> After a bit of use I’d agree with Fokko and others - this isn’t really a 
> problem, and separating them seems to do more harm than good related to 
> deployment.
> I was gonna stop there, but why?
> You can add a task to a dag that’s deployed and has run and still view 
> history.  The “new” task shows up white Squares in the old dags.  nobody said 
> you’re required to also rename the dag when you do so this.  If your process 
> or desire or design determines you need to rename it, well then by 
> definition... isn’t it a new thing without a history?  Airflow is 
> implementing exactly that.
> One could argue that renaming to reflect exact purpose is good practice.  
> Yes, I’d agree, but again following that logic if it’s a small enough change 
> to “slip in” then the name likely shouldn’t change.  If it’s big enough I 
> want to change the name then it’s a big enough change that I’m functionally 
> running something “new”, and I expect to need to account for that.  Airflow 
> is enforcing that logic by coupling the name to the deployment of what you 
> said was a new process.
> One might put forth that changing the name to be more descriptive In the ui 
> makes it easier for support staff.  I think perhaps if that’s your challenge 
> it’s not airflow that’s a problem.  Dags are of course documented elsewhere 
> besides their name, right?  Yeah it’s self documenting (and the graphs are 
> cool), but I have to assume there’s something besides the NAME to tell people 
> what it does.  Additionally, far more than the name is required for even an 
> operator or monitor watcher to take action - you don’t expect them to know 
> which tasks to rerun or how to troubleshoot failures just based on your “now 
> most descriptive name in the UI” do you?
> I spent time In an informatica shop where all the jobs were numbered.  
> Numbered.  Let’s be more exact... their NAMES were NUMBERS like 56709. 
> Terrible, but 100% worked, because while a descriptive name would have been 
> useful, the name is the thing that’s supposed to NOT CHANGE (see code of 
> Abibarshim), and all the other information can attach to that in places where 
> you write... other information.  People would curse a number “F’ing 6291 
> failed again” - everyone knew what they were talking about.. I digress.
>  You might decide to document “dag ID 12” or just “12” on your wiki - I’m  
> going to document “daily_sales_import”.  And when things start failing at 3am 
> it’s not my dag “56” that’s failing, it’s the sales_export dag.  But if you 
> document “12”, that’s still it’s name, and it’d better be 12 in all your 
> environments and documents.  This also means the actual db IDs from your 
> proposal are almost certainly NOT the same across your environments, making 
> the 12 unchangeable name!
> There are lots of languages (most of them) where the name of a thing is 
> important and hard to change.  It’s not a bad thing, and I’d assume that 
> deploying a thing by name has some significance in many systems.  Go rename a 
> class in... pick a language... tell me how that should be easier to do 
> willy-nilly so it’s easier In the UI.
> I suppose you could view it as a limitation, But i don’t think you’ve 
> illuminated a single use case where it’s an actual technical constraint or 
> limitation.
> The BEST argument against the current implementation is db performance.  It’s 
> a hogwash argument.  Basic key indexes on low cardinality string columns are 
> plenty fast for the airflow workload, and if your task load is so high 
> airflow can’t keep up or your seeing super-fast tasks and airflow db/tracking 
> latency is too much... perhaps a messaging or queue processing solution is 
> better suited to those workloads.  We see scheduler bottlenecks long before 
> the database for our “quick task” scenarios.  Additionally, reading through 
> this list you’ll find people running airflow at substantial scale - I’ve not 
> seen anyone complaining of production performance issues based on this design 
> decision.   At first I hated it.  String keys are dirty, we’re all taught 
> that as good little programmers.  Except when performance won’t be a huge 
> consideration since it’s not OLTP and easy of queryabilty is more important 
> because it’s a growing system... good decision - whoever made it.
> How does filename matter?  Frankly I wish the filename was REQUIRED to be the 
> dag name so people would quit confusing themselves by mismatching them !   
> We’ve renamed dag files with no issue as long as the content doesn’t change, 
> so again, not a real use case.  And really - name your stuff careful before 
> you get to prod man.
> I gotta ask - airflowuser - are you gonna use airflow for anything, or just 
> poke it with a stick from a distance and ask semi-inane questions of these 
> fine folks that wrote and spend time working on this cool piece of kit?
> B
> Sent from a device with less than stellar autocorrect
>> On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko <> wrote:
>> I like the dag_id for both the name and as an unique identifier. If you
>> change the dag in such a way, that it deserves a new name, you probably
>> want to create a new dag anyway. If you want to give some additional
>> context, you can use the description field:
>> The name of the file of dag does not have any influence.
>> My 2¢
>> Cheers, Fokko
>> Op do 20 sep. 2018 om 19:40 schreef James Meickle
>> <>:
>>> I'm personally against having some kind of auto-increment numeric ID for
>>> DAGs. While this makes a lot of sense for systems where creation is a
>>> database activity (like a POST request), in Airflow, DAG creation is
>>> actually a code ship activity. There are all kinds of complex scenarios
>>> around that:
>>> - I revert a commit and a DAG disappears or is renamed
>>> - I run the same file, twice, with multiple parameters to create two DAGs
>>> - I create the DAG in both staging and prod, but they wind up with
>>> different IDs
>>> It's just too hard to automatically track these scenarios.
>>> If we really wanted to put something like this in place, it would first
>>> make more sense to decouple DAG creation from code shipping, and instead
>>> prefer creation of a DAG outside of code (but with a definition that
>>> references which git repo/committish/file/arguments/etc. to use). Then if
>>> you do something like rename a file, the DAG breaks, but at least still
>>> exists in the db with that ID and history still makes sense once you update
>>> the DAG definition with the new code location.
>>> On Thu, Sep 20, 2018 at 4:52 AM airflowuser
>>> <> wrote:
>>>> Hi,
>>>> though this could have been explained on Jira I think this should be
>>>> discussed first.
>>>> The problem:
>>>> Airflow mixes DAG name with id. It uses same filed for both purposes.
>>>> I assume that most of you use the dag_id to describe what the DAG
>>> actually
>>>> does.
>>>> For example:
>>>> dag = DAG(
>>>>    dag_id='cost_report_daily',
>>>> ...
>>>> )
>>>> This dag_id is reflected to the dag id column in the UI.
>>>> Now, lets say that you want to add another task to this specific dag -
>>> You
>>>> are to be extremely careful when you change the dag_id to represent the
>>> new
>>>> functionality for example : dag_id='cost_expenses_reports_daily' . This
>>>> will break the history of the DAG.
>>>> Or even with simpler use case.. the user just want to change the name he
>>>> sees on the UI.
>>>> I suggest to have a discussion if the dag_id should be split into id (an
>>>> actual id) and name to reflect what it does. When the "connection" is
>>> done
>>>> by id's  - names can change as much as you want without breaking
>>> anything.
>>>> essentially it becomes a field uses for display purpose  only.
>>>> * I didn't mention also the issue of DAG file name which can also cause
>>>> trouble if someone wants to change it.
>>>> Sent with [ProtonMail]( Secure Email.

Reply via email to