r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-786041115
>Yeah. I was expecting exactly this answer :). So summarizing - what you've done now, you created hybrid dataclass and dictionary put together. Now, my goal is to show you what further consequences you have to deal with. [0_O](https://www.youtube.com/watch?v=4F4qzPbcFiA) >There are hundreds of thousands custom operators our there that are using this "public API". This means that any change we introduce now is going to be around at least a year from now. And until then 1.10 is still there as well (and will be there for quite some time). So if someone develops custom operators for their 1.10 Airflow, they will still use dictionary - so we have probably tens of thousands custom operators created still using the 'Dictionary' context for another year or two. Yes, this is true. Thats why I mentioned that now is the "easiest" time for this change as it only gets harder in the future; (In a perfect world, this would have been raised before 2.0 :s). > Most likely many of the currently released operators are using the context to pass data (as custom dictionary values) between those methods - one can set a custom value in pre_execute() and retrieve it in execute() or post_execute() reads whatever execute sets in the context. It was easy to use, we have not forbidden it, it is part of the API (this is the basic "property" of dictionary - unlike dataclass - that you can set any value with any key there). By introducing Dataclass we are breaking this property. You will not be able to set arbitrary key in the context in pre_execute so that it is available in execute. If we implement the interrim (lasting at least a year or more) hybrid dataclass <-> dictionary proposed above, this will continue to work but with deprecation warnings. Again dataclass does not break this property and in fact after thinking, I think dataclasses provides the following significant advantages in this area which dict and typedDict do not. 1. A safer way for users to set custom fields in context 1. A safer way for airflow maintainers to add new fields to context Take the following example <details><summary>Context Dataclass MVP V2</summary> ```python @dataclass class Demo: # context replacement id: str value_dc: int user_defined: Dict[str, Any] = field(default_factory=dict) def __getitem__(self, item): if item in self.__dict__.keys(): logging.warning(msg=f"dictionary interface getitem on context is deprecated; update to use the dataclass interface for standard fields like `{item}`") return self.__dict__[item] elif item in self.user_defined: logging.warning(msg=f"dictionary interface getitem on context is deprecated; update to use context.user_defined for custom fields like `{item}`") return self.user_defined[item] else: raise KeyError def __setitem__(self, key: str, value): if key in self.__dict__.keys(): msg = f"""dictionary interface setitem for standard fields is deprecated; update to use the dataclass interface for standard fields like `{key}` note: changing standard context fields is not supported and may have undefined behavior. If this is meant to be a custom field use context.user_defined instead""" logging.warning(msg=msg) self.__dict__[key] = value else: logging.warning( msg=f"dictionary interface setitem on context is deprecated; update to use context.user_defined for custom fields like `{key}`") self.user_defined[key] = value def keys(self): # added as an example to show how far we could go to have a non-breaking change for 2.1 logging.warning(msg=f"dictionary interface keys is deprecated; update this to use the dataclass interface") temp = self.__dict__ temp.update(self.user_defined) return temp d = Demo(id="long_id", value_dc=1337) print(d["id"]) d["new"] = 3 print(d["new"]) print(d.keys()) d["id"] = "warn" ``` returns ``` WARNING:root:dictionary interface getitem on context is deprecated; update to use the dataclass interface for standard fields like `id` WARNING:root:dictionary interface setitem on context is deprecated; update to use context.user_defined for custom fields like `new` WARNING:root:dictionary interface getitem on context is deprecated; update to use context.user_defined for custom fields like `new` WARNING:root:dictionary interface keys is deprecated; update this to use the dataclass interface WARNING:root:dictionary interface setitem for standard fields is deprecated; update to use the dataclass interface for standard fields like `id` note: changing standard context fields is not supported and may have undefined behavior. If this is meant to be a custom field use context.user_defined instead long_id 3 {'id': 'long_id', 'value_dc': 1337, 'user_defined': {'new': 3}, 'new': 3} ``` </details> By using the dataclass Airflow is now able to do several things 1. Provide a safe place for users to store user_defined context attributes without risk of clobbering current keys. * what happens if we need to add a key to context in the future that some of these 1000's of operators are now using as a custom field? (I believe they'd have broken code until they tracked down the silent error due to the context key re-write) 1. Provide clear and helpful deprecation and warning notices for bugs that people may have today and not realize * i.e. what happens if today someone is overwrite a key today without realizing it * More word smithing would probably be needed on the sample warnings above 1. We can let users subclass context if they want as an alternative to using `user_defined` dict to get strong typing. 1. We are able to provide a non-breaking change implementation for 2.1 >So my challenge for you is this - propose a strategy that will help to migrate all those cases by our users - in the way that will not make them refrain from migration to 3.0.0 with the Dataclass context. What checks are you going to use, how are you going to encourage the users to migrate all the thousands of DAGs they have, what tools you are going to provide them. My strategy is to add this deprecation warning as soon as possible i.e. 2.1 and give the users until 3.0 to update to the new interface. This will reduce the number of operators that get created with the old dictionary interface and give users the most time to upgrade. Additionally a quick 1 pager upgrade guide could be written to show the change in interface. > And once you do that, I will ask you - and other committers (undoubtedly looking at that conversation) - to answer single question: `Is it worth the hassle if we can achieve the very same user experience by using TypedDict ?` The thing is you don't get the same user experience by using TypedDict. By using dataclasses you get the following additional benefits. 1. Better IDE support (find usages/ refactoring / name highlighting) for users and mainters (see above for image example) 1. A safer interface for users who want to add custom fields to context 1. A safer way for maintainers to add new fields to context 1. A clean way to implement deprecation warnings with detailed warning messages about potential silent bugs 1. More flexibility down the road (dataclasses are more flexible than dictionaries) 1. A solution that is easier to maintain in the future * I concede that right now and for the rest of 2.x it will be more complicated to maintain. However thinking about the long-term future of airflow I think once Airflow 3+ rolls around we will be happy to have made this change due the reasons outlined above. * Or said another way we shouldn't optimize for airflow 2.x maintainability we should optimize for airflow maintainability. p.s. sorry for another wall of text. I guess it turns out that I'm a little passionate on this one... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
