[I] 🐛 Creating `Databricks` datasource using `databricks-sqlalchemy~=1` succeeds but stuck in a loop first [superset]

via GitHub Thu, 08 May 2025 01:37:41 -0700


martimors opened a new issue, #33388:
URL: https://github.com/apache/superset/issues/33388


   ### Bug description
   
   The `[databricks]` extra of the `apache-superset` package installs the 
deprecated and archived `sqlalchemy-databricks` package, which does not work 
with databricks anymore after the introduction of Unity catalog and deprecation 
(and soon removal) of hive metastore. The specific reason is not being able to 
ignore hidden partitioning columns from delta tables when running `DESCRIBE 
...` queries. The new driver, aptly named `databricks-sqlalchemy` for maximum 
confusion, is a drop-in replacement for the old one, supported by microsoft and 
should work out of the box. However, I have found one issue that was not there 
with the old driver.
   
   The datasource is created successfully, but the webserver never returns and 
the request times out.
   
   Steps to reproduce:
   1. From a stock apache/superset docker image, install the new databricks 
connector with sqlalchemy 1.x support (not the sqlalchemy 2.x one because it 
won't work with superset)
       ```sh
       pip install databricks-sqlalchemy~=1
       ```
   2. In superset, create a datasource, using the sqlalchemy URI like this:
       ```sh
       databricks://token:xxxxxxx...@xxxxxxxxxxx.azuredatabricks.net:443
       ```
       Include the extra engine parameters to identify a compute cluster to use:
       ```sh
       {"connect_args":{"http_path":"/sql/1.0/warehouses/XXXXXXXXXXXX"}}
       ```
       Lastly tick the box to allow selecting a catalog.
   3. Create the datasource and observe the browser hanging for ever and a 
spinner in the UI that does not disappear.
   4. Refresh the page and observe the datasource was created successfully 
after all.
   
   I have noticed that while this is going on, the databricks cluster is 
receiving `SHOW SCHEMAS` requests constantly in a loop:
   
   
![Image](https://github.com/user-attachments/assets/5314e4f9-9c3c-4248-8378-12576be283bf)
   
   The server logs are looping until it eventually returns a 201 if the client 
didn't time out already.
   
   [logs.txt](https://github.com/user-attachments/files/20099380/logs.txt)
   
   I suspect what is happening here is that Superset is pre-fetching all 
schemas in all catalogs, maybe just for testing od populating some cache?
   
   This is not ideal behavior in our environment because we have hundreds of 
catalogs and would only like to load what we need when we need it.
   
   ### Screenshots/recordings
   
   _No response_
   
   ### Superset version
   
   4.1.2
   
   ### Python version
   
   3.10
   
   ### Node version
   
   Not applicable
   
   ### Browser
   
   Not applicable
   
   ### Additional context
   
   _No response_
   
   ### Checklist
   
   - [x] I have searched Superset docs and Slack and didn't find a solution to 
my problem.
   - [x] I have searched the GitHub issue tracker and didn't find a similar bug 
report.
   - [x] I have checked Superset's logs for errors and if I found a relevant 
Python stacktrace, I included it here as text in the "additional context" 
section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

[I] 🐛 Creating `Databricks` datasource using `databricks-sqlalchemy~=1` succeeds but stuck in a loop first [superset]

Reply via email to