flyrain commented on issue #50:
URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3628550592
How we evaluate the results is critical to deciding whether any of these
“better context” approaches are actually effective. Having a solid evaluation
setup is a prerequisite for iterating on context strategies.
We probably need some simple benchmarks. My initial idea:
- Build a set of natural-language questions/tasks (e.g., “Create role X and
grant it to principal Y with privileges Z on table T”, "Create a table with
location xxx, properties xxx, schema xxx").
- Let the LLM use MCP to translate those questions into concrete MCP inputs
(JSON payloads).
- Evaluate whether the generated MCP input is correct along at least two
dimensions:
1. **Did it select the right command(s)?**
For example, did it generate the correct sequence of operations like
`create_role`, then `attach_role_to_principal`, vs. something unrelated?
2. **Is the payload correct and complete?**
This determines whether the command can actually succeed and whether it
accurately captures the user’s intent (correct fields, types, references, etc.).
Once we have a small but representative benchmark suite, we can compare
different context strategies (MCP prompt vs. Skill vs. RAG) in a more objective
way instead of relying on gut feel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]