bolkedebruin commented on PR #32319: URL: https://github.com/apache/airflow/pull/32319#issuecomment-1826915604
> > You are converting from one format to another. This is effectively serialization. This is why serde calls to serialize. By removing type information and not including versioning you cannot reinstate the original object. > > We have absolutely no intention of doing it. This is one way DBAPi -> returned value conversion. We have absolutely no need to add complexity by having (effectively) all DBApi providers to depend on `serde` implementation for that. That adds absolutely unnecessary coupling and serves no purpose from the point of view of standardising DBApi. > > > The question again is, why is "common format" required? Every DbAPI hook that has its own non-serializable format will need to include its own _make_serializable. This is what the serde serializers were invented for. Why is the common format required? What does it solve outside of serialization? > > Yes there is a very good reason we have it. > > It's nainly because of open-lineage integration (even if not implemented now then it's the way for it to be used in the fufure). Not everything revolves around serialization in Airflow. For DBApiHook, serialization is just after-thought. which is applied much later in the process. Serialization does not make a "standard" description of the returned data. Serialization looses semantic meaning of the returned data, where DBApIHook returns also meta-data that could be used to make column lineage semantically meaningful > Serialization is a lower-layer constructs that looses semantic meaning of the data returned by the Hook. While I think that extra metadata can be beneficial, it is not part of the spec at the moment. The odbc implementation does include it, the databricks hook doesn't. There is btw no reason that serialization by the standard framework can't do this. > This was the original intention of implementing common.sql and standardising what Hook returns. It's a building block on which future task-flow implementation of Open Lineage might be built for SQL data processing. > Well, I happy that you explain this, because that I could not get from the initial issue and the commits here and the comments (even when re-reading it now). The PR start with "Thus, I propose to add method in the DBApiHook to make the result of a query serializable". So yes, I read it as "do serialization". Sorry for that. > > This is what custom serializers do. Currently, the implementation here does not enforce that (Any). > > Yes. if you think the whole purpose of the interface is to serialize things. In our case we return DBApi-commong form of data that has merely a "property" of being serializable. It's an afterthought. merely a property to make it possible to use the outpit directly as operator output so that it can be stored in Xcom. > As mentioned above indeed the PR read this way. > > I disagree. This is going to be in for a long time, lets make sure we get the design right. > > I think we have it right the DBApi, it's purpose and reasoning have been discussed there for about 1.5 year and went through several interations. Again `_make_serializable` is just an intrnal implementation details of making sure that whatever gets returned is ALSO serializable. You seem to be giving it much more meaning than it has. > Nitpicking: I don't see any reference to that discussion here. And maybe I was therefore put into the wrong direction. > > The name is fine although I could argue for normalize or make_namedtuple, if we want to keep the function here. I can see some reasons why, but they are not overly convincing, I would appreciate an argumentation why not in serde's serializers. > > To avoid unnecessary coupling - if I can think of a single good reason. Our providers have deliberately wide range of supported Airflow versions they should be compatible with. We base a lot of assumptions on it. We cannot assume that we have specific version of Airflow available for us. Using serde and implementing it now in common.sql provider would mean that we will only be able to use it for Airlfow 2.8 that would support it - and possibly even rely on some internal stuff in serde - it woudl requre a set of unit tests testing compatibillity of various provider versions with various airflow version. This is a very bad idea to couple the two when all that we are talking about is to convert "specific" objects returned by Databricks to "serde" module that serializes it, relyiing - possibly - on specific serde version in specific airflow version. > Fair enough. On other hand we did introduce a > That would effectively make serde API and behaviour a Public API of Airflow that ALL DBHook providers should depend on - this coupling is not only unnecessary but also harmful and will slow us down. SERDE is not (and please correct me if I am wrong) not part of Public API of airflow. Are we willing to make 3rd-party providers depend on Serde? What APIs? Are they already public in https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html ? Because what you are really asking for is to make 3rd-party providers that would like to implement DBApiHook depend on it. > Thanks for the consideration. Good point. You are correct that it is not in the public API, although you could you argue that the format of serialization is and being able to serialize/deserialize by definition also. It is not the best argument here, I agree > Could you please explain what benefit implementing serde for this conversion would bring? I am looking for some good reasons why and, to be honest cannot find it. But since I explained why not, I'd love to hear why. While with the above points I am convinced the implementation is okay, for the sake of completeness and assuming no need for intermediate format (lineage): 1. Standardized architecture for serialization / deserialization. Tried and tested by now. 2. Forward compatible. To me the 'common format' currently still reads as not forward compatible when we deem that the format needs to change. 3. Available for serialization by others. If something now returns a `pyodbc.Row` into XCom for some reason they would need to implement their own serializer again and possibly the results cannot be interchanged. 4. Arguably more simple. This is what the serialization/deserialization would have looked like for a `pyodbc.Row` in the framework. ```python def serialize(o: object) -> tuple[U, str, int, bool]: import pyodbc o = cast(pyodbc.Row, o) columns: list[tuple[str, type]] = [col[:2] for col in o.cursor_description] row_object = NamedTuple("Row", columns) # type: ignore[misc] row = row_object(*o) return row, qualname(o), __version__, True def deserialize(classname: str, version: int, data: object) -> Any: import pyodbc if version > __version__: raise TypeError("serialized version is newer than class version") if classname == qualname(pyodbc.Row): return data raise TypeError(f"do not know how to deserialize {classname}") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
