Hi David,

Thanks for bringing this up. I have several thoughts on this and would love
to discuss them in more detail.

At this time, there is a fair bit of cognitive load with the 3.2.0 release.
But, will happily discuss right after that.

Best regards,
Vikram

On Thu, Apr 2, 2026 at 6:50 AM Blain David <[email protected]> wrote:

> Hi everyone,
>
> I would like to start a discussion if there would be an interest in adding
> a SAS provider to Airflow, based on some challenges and solutions we’ve
> encountered in our environment.  So let me explain.
>
> Until recently, we relied on the JdbcHook combined with GenericTransfer to
> extract data from SAS to our Postgres DWH. However, this approach
> introduces significant overhead due to the JVM dependency (via jaydebeapi).
> In practice, this leads to duplicated memory usage and instability during
> larger transfers (as our workers on K8S pods are limited in memory).
> Lately, we’ve experienced multiple failures due to new imports of very
> large tables, including JVM crashes caused by OOM issues as well as
> intermittent backend errors from SAS during long runs (multiple hours,
> especially for the initial runs).
>
> We also evaluated saspy as an alternative, but since it also relies on the
> same underlying JDBC/JVM mechanism (e.g. jaydebeapi), it does not resolve
> these issues.
>
> To address this, we implemented a native SASHook that uses SSH underneath
> to connect to the SAS environment. This allows us to execute SQL queries
> directly on the SAS server and retrieve results without involving a JVM. On
> top of that, we've built a DB-API 2.0–compatible cursor implementation that
> handles the interaction over SSH and integrates with pandas (leveraging
> read_sas for returning results as DataFrames), thus it integrates
> transparently with the DBApiHook.
>
> This approach has proven to be much more stable in our DAGs. Since
> adopting it, we have not encountered failures anymore. While performance is
> not on par with native ODBC or database drivers, it is still an improvement
> over the JDBC bridge. For reference, we can fetch around 10k records in ~2
> seconds. The initial query is slightly slower due to SSH handshake and
> session setup, but overall the trade-off is acceptable given the stability
> gains.  Probably performance can still be improved, and we will further
> check into it but at least we resolved the instability issue and our
> pipelines are reliable again.
>
> I would like to get feedback on how to contribute this, if any interest of
> course.
>
> Should I split this into two parts? A standalone open-source project
> providing a generic SAS DB-API 2.0 implementation and an Airflow provider
> that depends on this library?
> Or would it be preferable to include the full implementation directly
> within the SAS Airflow provider (e.g., as a hook and cursor)?
>
> I’m happy to contribute the implementation and help maintain it, but I’d
> like to align with the community on the preferred approach before moving
> forward, if there would be any interest of course.
>
> The advantage of this solution is that it doesn't need of any new
> dependencies, it purely relies on dependencies (paramiko an pandas) already
> present in the Airflow ecosystem.
>
> Any thoughts or suggestions would be greatly appreciated.
>
> Best regards,
> David
>

Reply via email to