MonkeyCanCode commented on issue #50: URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3694343838
> How we evaluate the results is critical to deciding whether any of these “better context” approaches are actually effective. Having a solid evaluation setup is a prerequisite for iterating on context strategies. > > We probably need some simple benchmarks. My initial idea: > > * Build a set of natural-language questions/tasks (e.g., “Create role X and grant it to principal Y with privileges Z on table T”, "Create a table with location xxx, properties xxx, schema xxx"). > * Let the LLM use MCP to translate those questions into concrete MCP inputs (JSON payloads). > * Evaluate whether the generated MCP input is correct along at least two dimensions: > > 1. **Did it select the right command(s)?** > For example, did it generate the correct sequence of operations like `create_role`, then `attach_role_to_principal`, vs. something unrelated? > 2. **Is the payload correct and complete?** > This determines whether the command can actually succeed and whether it accurately captures the user’s intent (correct fields, types, references, etc.). > > Once we have a small but representative benchmark suite, we can compare different context strategies (MCP prompt vs. Skill vs. RAG) in a more objective way instead of relying on gut feel. So I tried couple thing this week for the benchmark (with ollama and openai), the challenge is the accuracy of the mcp server does diff a lot based on the which llm model being used (it may be hard to run very large model locally) and running with tool such as Claude can burn these free tokens easily. Maybe there are better options than what I tried? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
