r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-786041115


   >Yeah. I was expecting exactly this answer :). So summarizing - what you've 
done now, you created hybrid dataclass and dictionary put together. Now, my 
goal is to show you what further consequences you have to deal with.
   
   [0_O](https://www.youtube.com/watch?v=4F4qzPbcFiA)
   
   >There are hundreds of thousands custom operators our there that are using 
this "public API". This means that any change we introduce now is going to be 
around at least a year from now. And until then 1.10 is still there as well 
(and will be there for quite some time). So if someone develops custom 
operators for their 1.10 Airflow, they will still use dictionary - so we have 
probably tens of thousands custom operators created still using the 
'Dictionary' context for another year or two.
   
   Yes, this is true. Thats why I mentioned that now is the "easiest" time for 
this change as it only gets harder in the future; (In a perfect world, this 
would have been raised before 2.0 :s).
   
   > Most likely many of the currently released operators are using the context 
to pass data (as custom dictionary values) between those methods - one can set 
a custom value in pre_execute() and retrieve it in execute() or post_execute() 
reads whatever execute sets in the context. It was easy to use, we have not 
forbidden it, it is part of the API (this is the basic "property" of dictionary 
- unlike dataclass - that you can set any value with any key there). By 
introducing Dataclass we are breaking this property. You will not be able to 
set arbitrary key in the context in pre_execute so that it is available in 
execute. If we implement the interrim (lasting at least a year or more) hybrid 
dataclass <-> dictionary proposed above, this will continue to work but with 
deprecation warnings.
   
   Again dataclass does not break this property and in fact after thinking, I 
think dataclasses provides the following significant advantages in this area 
which dict and typedDict do not.
   1. A safer way for users to set custom fields in context
   1. A safer way for airflow maintainers to add new fields to context
   
   Take the following example
   
   <details><summary>Context Dataclass MVP V2</summary>
   
   ```python
   
   @dataclass
   class Demo:  # context replacement
       id: str
       value_dc: int
       user_defined: Dict[str, Any] = field(default_factory=dict)
   
       def __getitem__(self, item):
           if item in self.__dict__.keys():
               logging.warning(msg=f"dictionary interface getitem on context is 
deprecated; update to use the dataclass interface for standard fields like 
`{item}`")
               return self.__dict__[item]
           elif item in self.user_defined:
               logging.warning(msg=f"dictionary interface getitem on context is 
deprecated; update to use context.user_defined for custom fields like `{item}`")
               return self.user_defined[item]
           else:
               raise KeyError
   
       def __setitem__(self, key: str, value):
           if key in self.__dict__.keys():
               msg = f"""dictionary interface setitem for standard fields is 
deprecated; update to use the dataclass interface for standard fields like 
`{key}`
               note: changing standard context fields is not supported and may 
have undefined behavior. If this is meant to be a custom field use 
context.user_defined instead"""
               logging.warning(msg=msg)
               self.__dict__[key] = value
           else:
               logging.warning(
                   msg=f"dictionary interface setitem on context is deprecated; 
update to use context.user_defined for custom fields like `{key}`")
               self.user_defined[key] = value
   
       def keys(self):
           # added as an example to show how far we could go to have a 
non-breaking change for 2.1
           logging.warning(msg=f"dictionary interface keys is deprecated; 
update this to use the dataclass interface")
           temp = self.__dict__
           temp.update(self.user_defined)
           return temp
   
   
   d = Demo(id="long_id", value_dc=1337)
   print(d["id"])
   d["new"] = 3
   print(d["new"])
   print(d.keys())
   d["id"] = "warn"
   ```
   returns
   ```
   WARNING:root:dictionary interface getitem on context is deprecated; update 
to use the dataclass interface for standard fields like `id`
   WARNING:root:dictionary interface setitem on context is deprecated; update 
to use context.user_defined for custom fields like `new`
   WARNING:root:dictionary interface getitem on context is deprecated; update 
to use context.user_defined for custom fields like `new`
   WARNING:root:dictionary interface keys is deprecated; update this to use the 
dataclass interface
   WARNING:root:dictionary interface setitem for standard fields is deprecated; 
update to use the dataclass interface for standard fields like `id`
               note: changing standard context fields is not supported and may 
have undefined behavior. If this is meant to be a custom field use 
context.user_defined instead
   long_id
   3
   {'id': 'long_id', 'value_dc': 1337, 'user_defined': {'new': 3}, 'new': 3}
   ```
   
   </details>
   
   By using the dataclass Airflow is now able to do several things
   1. Provide a safe place for users to store user_defined context attributes 
without risk of clobbering current keys.
        * what happens if we need to add a key to context in the future that 
some of these 1000's of operators are now using as a custom field? (I believe 
they'd have broken code until they tracked down the silent error due to the 
context key re-write)
   1. Provide clear and helpful deprecation and warning notices for bugs that 
people may have today and not realize
       * i.e. what happens if today someone is overwrite a key today without 
realizing it
       * More word smithing would probably be needed on the sample warnings 
above
   1. We can let users subclass context if they want as an alternative to using 
`user_defined` dict to get strong typing.
   1. We are able to provide a non-breaking change implementation for 2.1 
   
   >So my challenge for you is this - propose a strategy that will help to 
migrate all those cases by our users - in the way that will not make them 
refrain from migration to 3.0.0 with the Dataclass context. What checks are you 
going to use, how are you going to encourage the users to migrate all the 
thousands of DAGs they have, what tools you are going to provide them.
   
   My strategy is to add this deprecation warning as soon as possible i.e. 2.1 
and give the users until 3.0 to update to the new interface. This will reduce 
the number of operators that get created with the old dictionary interface and 
give users the most time to upgrade. Additionally a quick 1 pager upgrade guide 
could be written to show the change in interface.
   
   > And once you do that, I will ask you - and other committers (undoubtedly 
looking at that conversation) - to answer single question:
   `Is it worth the hassle if we can achieve the very same user experience by 
using TypedDict ?`
   
   The thing is you don't get the same user experience by using TypedDict. By 
using dataclasses you get the following additional benefits.
   1. Better IDE support (find usages/ refactoring / name highlighting) for 
users and mainters (see above for image example)
   1. A safer interface for users who want to add custom fields to context
   1. A safer way for maintainers to add new fields to context
   1. A clean way to implement deprecation warnings with detailed warning 
messages about potential silent bugs
   1. More flexibility down the road (dataclasses are more flexible than 
dictionaries)
   1. A solution that is easier to maintain in the future
       * I concede that right now and for the rest of 2.x it will be more 
complicated to maintain. However thinking about the long-term future of airflow 
I think once Airflow 3+ rolls around we will be happy to have made this change 
due the reasons outlined above.
       * Or said another way we shouldn't optimize for airflow 2.x 
maintainability we should optimize for airflow maintainability.
       
   p.s. sorry for another wall of text. I guess it turns out that I'm a little 
passionate on this one... 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to