gaspardc-met opened a new issue, #1881:
URL: https://github.com/apache/arrow-adbc/issues/1881
### What happened?
Context before the bug (working):
- Postgres database on Kubernetes with several tables
- 4 services (webapp, machine learning inference, and FastAPI backend APIs)
deployed on kubernetes and fetching data from postgres
- 1 service, data orchestrator, writing data to Postgres
- Fetching data from PG with `pd.read_sql` from Pandas and a SQLalchemy
engine
- Been doing this for 1+ year without any Postgres issues
Switching to ADBC:
- Following my upgrade to pandas >2.0.0 I wanted to switch to
`adbc_driver_postgresql`'s `dbapi` connection with `pd.read_sql`
- Initial tests were great, it was faster than before
- Deployed this to production on all aforementioned services twice
(initially with connection caching, then no caching and properly closing each
and every connection
- Once again initially smooth, everything worked and was fast
Problem:
- In both instances of the deployment, within ~12 hours, the connections
would be stuck
- Webapp or another service would create an ADBC connection, and run the sql
query with `pd.read_sql` (know this through caching) and then wait
indefinitely.
- Reloading the webapp, clearing webapp cache, recreating the connection
would do nothing at all
- The log on the Postgres pod indicated a password issue with the current
database/user, which never happened before
- Both SQLalchemy and ADBC get the same postgres URI to create the
engine/connection with
- Reverting to SQLalchemy solved the problem, and the error has not been
seen again
### How can we reproduce the bug?
- The given URI was `"postgresql://{user}:{password}@{host}:{port}/{db}"`
formatted with the proper values
- The function was used to create the ADBC connection:
```python
def create_adbc_conn() -> Connection:
logger_stdout.info(f"Creating a new ADBC connection at
{pd.Timestamp.now()}.")
uri = get_default_uri() # URI shown above, formatted
connection = dbapi.connect(uri=uri)
logger_stdout.info("ADBC connection created")
return connection
```
- The function to execute the SQL query was:
```python
def handle_sql_query(
sql: str,
index_col: Optional[str] = None,
connection: Optional[Connection] = None,
need_to_close: bool = False,
) -> pd.DataFrame:
if engine is None:
logger_stdout.info(f"Engine is None, creating a new ADBC connection
at {pd.Timestamp.now()}.")
connection= create_adbc_conn()
need_to_close = True
try:
logger_stdout.info("Executing SQL query with connection")
return pd.read_sql_query(sql=sql, con=connection,
index_col=index_col, parse_dates=[index_col])
finally:
if need_to_close:
logger_stdout.info("Closing the ADBC connection.")
connection.close()
```
- The SQL queries ranged from `select * from TABLE_NAME` to selecting
specific columns on a range of specific dates
### Environment/Setup
python 3.11
pandas == 2.2.2
adbc_driver_postgresql==0.11.0
adbc-driver-manager==0.11.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]