SkinnyPigeon opened a new pull request, #36187: URL: https://github.com/apache/superset/pull/36187
<!--- Please write the PR title following the conventions at https://www.conventionalcommits.org/en/v1.0.0/ Example: fix(dashboard): load charts correctly --> ### SUMMARY AI-powered SQL generation POC for Apache Superset. Adds an AI Assistant that generates and executes SQL queries from natural language using HuggingFace models. Only the input text and the dataset's column descriptions are used to generate the query with the AI not able to view the underlying data. I wanted to gauge interest in a feature like this in the OSS version. I know we have team members hoping to use something like this to interact with data. --- This adds a new AI Assistant page to the Superset frontend alongside a new API endpoint. This uses the `huggingface_hub` package alongside a free token to generate the query text based on user questions about individual datasets. This works well when teams have added column descriptions, as they are passed to the model to inform it of the best way to generate the query. The videos below show it running as well as giving a quick preview of the context shared with the model. I am not a frontend person so please forgive the sytling. If this was a feature others were interested in, I would require help getting this up to the project's standards, including help with the frontend testing suite. I have tested this with PostgreSQL and Redshift-based datasets. More testing with other dialects would be welcome. As the default model, I've gone for `Qwen/Qwen2.5-Coder-32B-Instruct` which seems to do the job quite well. This and the token are able to be set via env vars and have been added to the config. ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF Video showing the user using natural language in a chat box, which returns a SQL query that is then executed against the data source with the results shown inside the chat box https://github.com/user-attachments/assets/aa098f92-4637-41bc-89b8-5ff2578fa7c1 Data showing the query context that allows the SQL to be generated. This is autopopulated via the `desscriptions` of the columns of the Dataset, along with additional info such as the database's dialect to try to generate the correct syntax https://github.com/user-attachments/assets/728b53df-6f24-43bf-9c80-0970cb308ca7 The default page look: <img width="1506" height="842" alt="Screenshot 2025-11-19 at 14 20 42" src="https://github.com/user-attachments/assets/adb30453-4b7f-4ef3-ac39-244a67a57bcf" /> The page with a Dataset selected from the drop down, giving the user a small preview of the data they will be querying: <img width="1510" height="849" alt="Screenshot 2025-11-19 at 14 17 36" src="https://github.com/user-attachments/assets/df35e3e9-a2cf-4016-8622-dd71334f3338" /> ### TESTING INSTRUCTIONS 1. Set HuggingFace API token via [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). This can be a free account with no credit card required. 2. Make the env var available to your deployment process, e.g. `export HF_API_TOKEN=your_token` 3. Optional: Set model: `export HF_MODEL=Qwen/Qwen2.5-Coder-32B-Instruct` 4. Build and run Superset 5. Pick a dataset and add descriptions for any columns you would like to query, an example of values for `ab_user` can be found below 6. Navigate to `/aiassistant/` 7. Select a dataset 8. Ask: "Who is the most active user currently?" or "Which was the month with the most new users?"... 9. Click "Execute Query" to see results ### ADDITIONAL INFORMATION <!--- Check any relevant boxes with "x" --> <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue --> - [ ] Has associated issue: - [ ] Required feature flags: - [x] Changes UI - [ ] Includes DB Migration (follow approval process in [SIP-59](https://github.com/apache/superset/issues/13351)) - [ ] Migration is atomic, supports rollback & is backwards-compatible - [ ] Confirm DB migration upgrade and downgrade tested - [ ] Runtime estimates and downtime expectations provided - [x] Introduces new feature or API - [ ] Removes existing feature or API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
