potiuk commented on PR #32319:
URL: https://github.com/apache/airflow/pull/32319#issuecomment-1826902225

   > You are converting from one format to another. This is effectively 
serialization. This is why serde calls to serialize. By removing type 
information and not including versioning you cannot reinstate the original 
object.
   
   We have absolutely no intention of doing it. This is one way DBAPi -> 
returned value conversion. We have absolutely no need to add complexity by 
having (effectively) all DBApi providers to depend on `serde` implementation 
for that. That adds absolutely unnecessary coupling and serves no purpose from 
the point of view of standardising DBApi.
   
   > The question again is, why is "common format" required? Every DbAPI hook 
that has its own non-serializable format will need to include its own 
_make_serializable. This is what the serde serializers were invented for. Why 
is the common format required? What does it solve outside of serialization?
   
   Yes there is a very good reason we have it.
   
   It's nainly because of open-lineage integration (even if not implemented now 
then it's the way for it to be used in the fufure). Not everything revolves 
around serialization in Airflow. For DBApiHook, serialization is just 
after-thought. which is applied much later in the process. Serialization does 
not make a "standard" description of the returned data. Serialization looses 
semantic meaning of the returned data, where DBApIHook returns also meta-data 
that could be used to make column lineage semantically meaningful. 
Serialization is a lower-layer constructs that looses semantic meaning of the 
data returned by the Hook. This was the original intention of implementing 
common.sql and standardising what Hook returns. It's a building block on which 
future task-flow implementation of Open Lineage might be built for SQL data 
processing.
   
   > This is what custom serializers do. Currently, the implementation here 
does not enforce that (Any).
   
   Yes. if you think the whole purpose of the interface is to serialize things. 
In our case we return DBApi-commong form of data that has merely a "property" 
of being serializable. It's an afterthought. merely a property to make it 
possible to use the outpit directly as operator output so that it can be stored 
in Xcom.
   
   > I disagree. This is going to be in for a long time, lets make sure we get 
the design right.
   
   I think we have it right the DBApi, it's purpose and reasoning have been 
discussed there for about 1.5 year and  went through several interations. Again 
`_make_serializable` is just an intrnal implementation details of making sure 
that whatever gets returned is ALSO serializable. You seem to be giving it much 
more meaning than it has.
   
   >  The name is fine although I could argue for normalize or make_namedtuple, 
if we want to keep the function here. I can see some reasons why, but they are 
not overly convincing, I would appreciate an argumentation why not in serde's 
serializers.
   
   To avoid unnecessary coupling - if I can think of a single good reason. Our 
providers have deliberately wide range of supported Airflow versions they 
should be compatible with. We base a lot of assumptions on it. We cannot assume 
that we have specific version of Airflow available for us. Using serde and 
implementing it now in common.sql provider would mean that we will only be able 
to use it for Airlfow 2.8 that would support it - and possibly even rely on 
some internal stuff in serde - it woudl requre a set of unit tests testing 
compatibillity of various provider versions with various airflow version. This 
is a very bad idea to couple the two when all that we are talking about is to 
convert "specific" objects returned by Databricks to "serde" module that 
serializes it, relyiing - possibly - on specific serde version in specific 
airflow version.
   
   That would effectively make serde API and behaviour a Public API of Airflow 
that ALL DBHook providers should depend on - this coupling is not only 
unnecessary but also harmful and will slow us down. SERDE is not (and please 
correct me if I am wrong) not part of Public API of airflow. Are we willing to 
make 3rd-party providers depend on Serde? What APIs? Are they already public in 
https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html
 ? Because what you are really asking for is to make 3rd-party providers that 
would like to implement DBApiHook dpeend on it.
   
   I personally think it's a bad idea to add this coupling.
   
   Now that I explained why not. I have one question that maybe you can explain
   
   Could you please explain what benefit implementing serde for this conversion 
would bring? I am looking for some good reasons why and, to be honest cannot 
find it. But since I explained why not, I'd love to hear why.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to