SkinnyPigeon opened a new pull request, #36187:
URL: https://github.com/apache/superset/pull/36187

   <!---
   Please write the PR title following the conventions at 
https://www.conventionalcommits.org/en/v1.0.0/
   Example:
   fix(dashboard): load charts correctly
   -->
   
   ### SUMMARY
   AI-powered SQL generation POC for Apache Superset. Adds an AI Assistant that 
generates and executes SQL queries from natural language using HuggingFace 
models. Only the input text and the dataset's column descriptions are used to 
generate the query with the AI not able to view the underlying data.
   
   I wanted to gauge interest in a feature like this in the OSS version. I know 
we have team members hoping to use something like this to interact with data. 
   
   ---
   This adds a new AI Assistant page to the Superset frontend alongside a new 
API endpoint. This uses the `huggingface_hub` package alongside a free token to 
generate the query text based on user questions about individual datasets. This 
works well when teams have added column descriptions, as they are passed to the 
model to inform it of the best way to generate the query. The videos below show 
it running as well as giving a quick preview of the context shared with the 
model.
   
   I am not a frontend person so please forgive the sytling. If this was a 
feature others were interested in, I would require help getting this up to the 
project's standards, including help with the frontend testing suite.
   
   I have tested this with PostgreSQL and Redshift-based datasets. More testing 
with other dialects would be welcome.
   
   As the default model, I've gone for `Qwen/Qwen2.5-Coder-32B-Instruct` which 
seems to do the job quite well. This and the token are able to be set via env 
vars and have been added to the config.
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   Video showing the user using natural language in a chat box, which returns a 
SQL query that is then executed against the data source with the results shown 
inside the chat box
   
https://github.com/user-attachments/assets/aa098f92-4637-41bc-89b8-5ff2578fa7c1
   
   Data showing the query context that allows the SQL to be generated. This is 
autopopulated via the `desscriptions` of the columns of the Dataset, along with 
additional info such as the database's dialect to try to generate the correct 
syntax
   
https://github.com/user-attachments/assets/728b53df-6f24-43bf-9c80-0970cb308ca7
   
   The default page look:
   <img width="1506" height="842" alt="Screenshot 2025-11-19 at 14 20 42" 
src="https://github.com/user-attachments/assets/adb30453-4b7f-4ef3-ac39-244a67a57bcf";
 />
   
   The page with a Dataset selected from the drop down, giving the user a small 
preview of the data they will be querying: 
   <img width="1510" height="849" alt="Screenshot 2025-11-19 at 14 17 36" 
src="https://github.com/user-attachments/assets/df35e3e9-a2cf-4016-8622-dd71334f3338";
 />
   
   
   ### TESTING INSTRUCTIONS
   1. Set HuggingFace API token via 
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). This 
can be a free account with no credit card required. 
   2. Make the env var available to your deployment process, e.g. `export 
HF_API_TOKEN=your_token`
   3. Optional: Set model: `export HF_MODEL=Qwen/Qwen2.5-Coder-32B-Instruct`
   4. Build and run Superset
   5. Pick a dataset and add descriptions for any columns you would like to 
query, an example of values for `ab_user` can be found below
   6. Navigate to `/aiassistant/`
   7. Select a dataset
   8. Ask: "Who is the most active user currently?" or "Which was the month 
with the most new users?"...
   9. Click "Execute Query" to see results
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Required feature flags:
   - [x] Changes UI
   - [ ] Includes DB Migration (follow approval process in 
[SIP-59](https://github.com/apache/superset/issues/13351))
     - [ ] Migration is atomic, supports rollback & is backwards-compatible
     - [ ] Confirm DB migration upgrade and downgrade tested
     - [ ] Runtime estimates and downtime expectations provided
   - [x] Introduces new feature or API
   - [ ] Removes existing feature or API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to