potiuk commented on PR #32319: URL: https://github.com/apache/airflow/pull/32319#issuecomment-1826902225
> You are converting from one format to another. This is effectively serialization. This is why serde calls to serialize. By removing type information and not including versioning you cannot reinstate the original object. We have absolutely no intention of doing it. This is one way DBAPi -> returned value conversion. We have absolutely no need to add complexity by having (effectively) all DBApi providers to depend on `serde` implementation for that. That adds absolutely unnecessary coupling and serves no purpose from the point of view of standardising DBApi. > The question again is, why is "common format" required? Every DbAPI hook that has its own non-serializable format will need to include its own _make_serializable. This is what the serde serializers were invented for. Why is the common format required? What does it solve outside of serialization? Yes there is a very good reason we have it. It's nainly because of open-lineage integration (even if not implemented now then it's the way for it to be used in the fufure). Not everything revolves around serialization in Airflow. For DBApiHook, serialization is just after-thought. which is applied much later in the process. Serialization does not make a "standard" description of the returned data. Serialization looses semantic meaning of the returned data, where DBApIHook returns also meta-data that could be used to make column lineage semantically meaningful. Serialization is a lower-layer constructs that looses semantic meaning of the data returned by the Hook. This was the original intention of implementing common.sql and standardising what Hook returns. It's a building block on which future task-flow implementation of Open Lineage might be built for SQL data processing. > This is what custom serializers do. Currently, the implementation here does not enforce that (Any). Yes. if you think the whole purpose of the interface is to serialize things. In our case we return DBApi-commong form of data that has merely a "property" of being serializable. It's an afterthought. merely a property to make it possible to use the outpit directly as operator output so that it can be stored in Xcom. > I disagree. This is going to be in for a long time, lets make sure we get the design right. I think we have it right the DBApi, it's purpose and reasoning have been discussed there for about 1.5 year and went through several interations. Again `_make_serializable` is just an intrnal implementation details of making sure that whatever gets returned is ALSO serializable. You seem to be giving it much more meaning than it has. > The name is fine although I could argue for normalize or make_namedtuple, if we want to keep the function here. I can see some reasons why, but they are not overly convincing, I would appreciate an argumentation why not in serde's serializers. To avoid unnecessary coupling - if I can think of a single good reason. Our providers have deliberately wide range of supported Airflow versions they should be compatible with. We base a lot of assumptions on it. We cannot assume that we have specific version of Airflow available for us. Using serde and implementing it now in common.sql provider would mean that we will only be able to use it for Airlfow 2.8 that would support it - and possibly even rely on some internal stuff in serde - it woudl requre a set of unit tests testing compatibillity of various provider versions with various airflow version. This is a very bad idea to couple the two when all that we are talking about is to convert "specific" objects returned by Databricks to "serde" module that serializes it, relyiing - possibly - on specific serde version in specific airflow version. That would effectively make serde API and behaviour a Public API of Airflow that ALL DBHook providers should depend on - this coupling is not only unnecessary but also harmful and will slow us down. SERDE is not (and please correct me if I am wrong) not part of Public API of airflow. Are we willing to make 3rd-party providers depend on Serde? What APIs? Are they already public in https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html ? Because what you are really asking for is to make 3rd-party providers that would like to implement DBApiHook dpeend on it. I personally think it's a bad idea to add this coupling. Now that I explained why not. I have one question that maybe you can explain Could you please explain what benefit implementing serde for this conversion would bring? I am looking for some good reasons why and, to be honest cannot find it. But since I explained why not, I'd love to hear why. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
