kaxil opened a new pull request, #62785: URL: https://github.com/apache/airflow/pull/62785
Part of AIP-99: https://github.com/orgs/apache/projects/586 : Toolsets that expose Airflow hooks as pydantic-ai agent tools. - **HookToolset** — generic adapter that exposes any Airflow Hook's methods as pydantic-ai tools via signature introspection. Requires explicit `allowed_methods` list (no auto-discovery). Builds JSON Schema from method signatures, enriches tool descriptions from docstrings (Sphinx and Google style). - **SQLToolset** — curated 4-tool database toolset (`list_tables`, `get_schema`, `query`, `check_query`) wrapping `DbApiHook`. Read-only by default with SQL validation, `allowed_tables` metadata filtering, and `max_rows` truncation. Both implement pydantic-ai's `AbstractToolset` interface. ## Design rationale **Why custom introspection instead of pydantic-ai's `_function_schema`?** Hook methods are bound methods with `self`, decorators like `@provide_bucket_name`, and complex signatures. Our lightweight approach (`inspect.signature` + `get_type_hints`) avoids coupling to pydantic-ai internals. **Why `sequential=True` on all tool definitions?** Hook methods perform synchronous I/O and share connection state. Concurrent execution would be unsafe. **Why `allowed_tables` is metadata-only, not query-level validation?** Parsing SQL for table references (CTEs, subqueries, aliases, vendor-specific syntax) is complex and error-prone. We chose not to provide a false sense of security. Real access control belongs at the DB permission level. **Why HookToolset requires explicit `allowed_methods`?** Auto-discovery would expose every public method on a hook (including `run()`, `get_connection()`, etc.), giving an LLM broad unintended access. Explicit listing forces DAG authors to think about the blast radius. ## Usage ```python from airflow.providers.common.ai.toolsets.hook import HookToolset from airflow.providers.common.ai.toolsets.sql import SQLToolset # SQL toolset — 4 curated tools for database access sql_tools = SQLToolset( db_conn_id="postgres_default", allowed_tables=["customers", "orders"], max_rows=20, ) # Hook toolset — wrap any hook's methods as tools from airflow.providers.http.hooks.http import HttpHook http_tools = HookToolset( HttpHook(http_conn_id="my_api"), allowed_methods=["run"], tool_name_prefix="http_", ) ``` ## Gotchas / Tradeoffs - `allowed_tables` hides tables from `list_tables`/`get_schema` but does NOT parse SQL queries. An LLM can `SELECT * FROM secrets` if it guesses the name. Use DB permissions for real access control. - `HookToolset` exposes whatever methods you list — the agent controls the arguments. Don't expose `run()` or `get_connection()`. - `allow_writes=False` (default) validates SQL through `validate_sql()` and rejects INSERT/UPDATE/DELETE/DROP. - SQLToolset lazy-resolves the `DbApiHook` on first use via `BaseHook.get_connection(conn_id).get_hook()`. Non-DbApiHook connections raise `ValueError`. <!-- SPDX-License-Identifier: Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 --> <!-- Thank you for contributing! Please provide above a brief description of the changes made in this pull request. Write a good git commit message following this guide: http://chris.beams.io/posts/git-commit/ Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping (in general) for the review if you do not see reaction for a few days (72 Hours is the minimum reaction time you can expect from volunteers) - we sometimes miss notifications. In case of an existing issue, reference it using one of the following: * closes: #ISSUE * related: #ISSUE --> --- ##### Was generative AI tooling used to co-author this PR? <!-- If generative AI tooling has been used in the process of authoring this PR, please change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by". --> - [ ] Yes (please specify the tool below) <!-- Generated-by: [Tool Name] following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) --> --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). * For significant user-facing changes create newsfragment: `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
