MonkeyCanCode commented on issue #50:
URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3694343838

   > How we evaluate the results is critical to deciding whether any of these 
“better context” approaches are actually effective. Having a solid evaluation 
setup is a prerequisite for iterating on context strategies.
   > 
   > We probably need some simple benchmarks. My initial idea:
   > 
   > * Build a set of natural-language questions/tasks (e.g., “Create role X 
and grant it to principal Y with privileges Z on table T”, "Create a table with 
location xxx, properties xxx, schema xxx").
   > * Let the LLM use MCP to translate those questions into concrete MCP 
inputs (JSON payloads).
   > * Evaluate whether the generated MCP input is correct along at least two 
dimensions:
   >   
   >   1. **Did it select the right command(s)?**
   >      For example, did it generate the correct sequence of operations like 
`create_role`, then `attach_role_to_principal`, vs. something unrelated?
   >   2. **Is the payload correct and complete?**
   >      This determines whether the command can actually succeed and whether 
it accurately captures the user’s intent (correct fields, types, references, 
etc.).
   > 
   > Once we have a small but representative benchmark suite, we can compare 
different context strategies (MCP prompt vs. Skill vs. RAG) in a more objective 
way instead of relying on gut feel.
   
   So I tried couple thing this week for the benchmark (with ollama and 
openai), the challenge is the accuracy of the mcp server does diff a lot based 
on the which llm model being used (it may be hard to run very large model 
locally) and running with tool such as Claude can burn these free tokens 
easily. Maybe there are better options than what I tried?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to