Re: [PR] Make pyodbc.Row and databricks.Row JSON-serializable via new `make_serializable` method [airflow]

via GitHub Sun, 26 Nov 2023 14:29:11 -0800


bolkedebruin commented on PR #32319:
URL: https://github.com/apache/airflow/pull/32319#issuecomment-1826915604


   > > You are converting from one format to another. This is effectively 
serialization. This is why serde calls to serialize. By removing type 
information and not including versioning you cannot reinstate the original 
object.
   > 
   > We have absolutely no intention of doing it. This is one way DBAPi -> 
returned value conversion. We have absolutely no need to add complexity by 
having (effectively) all DBApi providers to depend on `serde` implementation 
for that. That adds absolutely unnecessary coupling and serves no purpose from 
the point of view of standardising DBApi.
   > 
   > > The question again is, why is "common format" required? Every DbAPI hook 
that has its own non-serializable format will need to include its own 
_make_serializable. This is what the serde serializers were invented for. Why 
is the common format required? What does it solve outside of serialization?
   > 
   > Yes there is a very good reason we have it.
   > 
   > It's nainly because of open-lineage integration (even if not implemented 
now then it's the way for it to be used in the fufure). Not everything revolves 
around serialization in Airflow. For DBApiHook, serialization is just 
after-thought. which is applied much later in the process. Serialization does 
not make a "standard" description of the returned data. Serialization looses 
semantic meaning of the returned data, where DBApIHook returns also meta-data 
that could be used to make column lineage semantically meaningful
   > Serialization is a lower-layer constructs that looses semantic meaning of 
the data returned by the Hook. 
   
   While I think that extra metadata can be beneficial, it is not part of the 
spec at the moment. The odbc implementation does include it, the databricks 
hook doesn't. There is btw no reason that serialization by the standard 
framework can't do this.
   
   > This was the original intention of implementing common.sql and 
standardising what Hook returns. It's a building block on which future 
task-flow implementation of Open Lineage might be built for SQL data processing.
   > 
   
   Well, I happy that you explain this, because that I could not get from the 
initial issue and the commits here and the comments (even when re-reading it 
now). The PR start with "Thus, I propose to add method in the DBApiHook to make 
the result of a query serializable". So yes, I read it as "do serialization". 
Sorry for that.
   
   > > This is what custom serializers do. Currently, the implementation here 
does not enforce that (Any).
   > 
   > Yes. if you think the whole purpose of the interface is to serialize 
things. In our case we return DBApi-commong form of data that has merely a 
"property" of being serializable. It's an afterthought. merely a property to 
make it possible to use the outpit directly as operator output so that it can 
be stored in Xcom.
   > 
   
   As mentioned above indeed the PR read this way.
   
   > > I disagree. This is going to be in for a long time, lets make sure we 
get the design right.
   > 
   > I think we have it right the DBApi, it's purpose and reasoning have been 
discussed there for about 1.5 year and went through several interations. Again 
`_make_serializable` is just an intrnal implementation details of making sure 
that whatever gets returned is ALSO serializable. You seem to be giving it much 
more meaning than it has.
   > 
   
   Nitpicking: I don't see any reference to that discussion here. And maybe I 
was therefore put into the wrong direction. 
   
   > > The name is fine although I could argue for normalize or 
make_namedtuple, if we want to keep the function here. I can see some reasons 
why, but they are not overly convincing, I would appreciate an argumentation 
why not in serde's serializers.
   > 
   > To avoid unnecessary coupling - if I can think of a single good reason. 
Our providers have deliberately wide range of supported Airflow versions they 
should be compatible with. We base a lot of assumptions on it. We cannot assume 
that we have specific version of Airflow available for us. Using serde and 
implementing it now in common.sql provider would mean that we will only be able 
to use it for Airlfow 2.8 that would support it - and possibly even rely on 
some internal stuff in serde - it woudl requre a set of unit tests testing 
compatibillity of various provider versions with various airflow version. This 
is a very bad idea to couple the two when all that we are talking about is to 
convert "specific" objects returned by Databricks to "serde" module that 
serializes it, relyiing - possibly - on specific serde version in specific 
airflow version.
   > 
   
   Fair enough. On other hand we did introduce a 
   
   > That would effectively make serde API and behaviour a Public API of 
Airflow that ALL DBHook providers should depend on - this coupling is not only 
unnecessary but also harmful and will slow us down. SERDE is not (and please 
correct me if I am wrong) not part of Public API of airflow. Are we willing to 
make 3rd-party providers depend on Serde? What APIs? Are they already public in 
https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html
 ? Because what you are really asking for is to make 3rd-party providers that 
would like to implement DBApiHook depend on it.
   > 
   
   Thanks for the consideration. Good point. You are correct that it is not in 
the public API, although you could you argue that the format of serialization 
is and being able to serialize/deserialize by definition also. It is not the 
best argument here, I agree
   
   > Could you please explain what benefit implementing serde for this 
conversion would bring? I am looking for some good reasons why and, to be 
honest cannot find it. But since I explained why not, I'd love to hear why.
   
   While with the above points I am convinced the implementation is okay, for 
the sake of completeness and assuming no need for intermediate format (lineage):
   
   1. Standardized architecture for serialization / deserialization. Tried and 
tested by now.
   2. Forward compatible. To me the 'common format' currently still reads as 
not forward compatible when we deem that the format needs to change.
   3. Available for serialization by others. If something now returns a 
`pyodbc.Row` into XCom for some reason they would need to implement their own 
serializer again and possibly the results cannot be interchanged.
   4. Arguably more simple. This is what the serialization/deserialization 
would have looked like for a `pyodbc.Row` in the framework.
   
   ```python
   def serialize(o: object) -> tuple[U, str, int, bool]:
       import pyodbc
   
       o = cast(pyodbc.Row, o)
   
       columns: list[tuple[str, type]] = [col[:2] for col in 
o.cursor_description]
       row_object = NamedTuple("Row", columns)  # type: ignore[misc]
       row = row_object(*o)
   
       return row, qualname(o), __version__, True
   
   
   def deserialize(classname: str, version: int, data: object) -> Any:
       import pyodbc
   
       if version > __version__:
           raise TypeError("serialized version is newer than class version")
   
       if classname == qualname(pyodbc.Row):
           return data
   
       raise TypeError(f"do not know how to deserialize {classname}")
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Make pyodbc.Row and databricks.Row JSON-serializable via new `make_serializable` method [airflow]

Reply via email to