Hi everyone,

I would like to start a discussion if there would be an interest in adding a 
SAS provider to Airflow, based on some challenges and solutions we’ve 
encountered in our environment.  So let me explain.

Until recently, we relied on the JdbcHook combined with GenericTransfer to 
extract data from SAS to our Postgres DWH. However, this approach introduces 
significant overhead due to the JVM dependency (via jaydebeapi). In practice, 
this leads to duplicated memory usage and instability during larger transfers 
(as our workers on K8S pods are limited in memory). Lately, we’ve experienced 
multiple failures due to new imports of very large tables, including JVM 
crashes caused by OOM issues as well as intermittent backend errors from SAS 
during long runs (multiple hours, especially for the initial runs).

We also evaluated saspy as an alternative, but since it also relies on the same 
underlying JDBC/JVM mechanism (e.g. jaydebeapi), it does not resolve these 
issues.

To address this, we implemented a native SASHook that uses SSH underneath to 
connect to the SAS environment. This allows us to execute SQL queries directly 
on the SAS server and retrieve results without involving a JVM. On top of that, 
we've built a DB-API 2.0–compatible cursor implementation that handles the 
interaction over SSH and integrates with pandas (leveraging read_sas for 
returning results as DataFrames), thus it integrates transparently with the 
DBApiHook.

This approach has proven to be much more stable in our DAGs. Since adopting it, 
we have not encountered failures anymore. While performance is not on par with 
native ODBC or database drivers, it is still an improvement over the JDBC 
bridge. For reference, we can fetch around 10k records in ~2 seconds. The 
initial query is slightly slower due to SSH handshake and session setup, but 
overall the trade-off is acceptable given the stability gains.  Probably 
performance can still be improved, and we will further check into it but at 
least we resolved the instability issue and our pipelines are reliable again.

I would like to get feedback on how to contribute this, if any interest of 
course.

Should I split this into two parts? A standalone open-source project providing 
a generic SAS DB-API 2.0 implementation and an Airflow provider that depends on 
this library?
Or would it be preferable to include the full implementation directly within 
the SAS Airflow provider (e.g., as a hook and cursor)?

I’m happy to contribute the implementation and help maintain it, but I’d like 
to align with the community on the preferred approach before moving forward, if 
there would be any interest of course.

The advantage of this solution is that it doesn't need of any new dependencies, 
it purely relies on dependencies (paramiko an pandas) already present in the 
Airflow ecosystem.

Any thoughts or suggestions would be greatly appreciated.

Best regards,
David

Reply via email to