flyrain commented on issue #50:
URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3628550592

   How we evaluate the results is critical to deciding whether any of these 
“better context” approaches are actually effective. Having a solid evaluation 
setup is a prerequisite for iterating on context strategies.
   
   We probably need some simple benchmarks. My initial idea:
   - Build a set of natural-language questions/tasks (e.g., “Create role X and 
grant it to principal Y with privileges Z on table T”, "Create a table with 
location xxx, properties xxx, schema xxx").
   - Let the LLM use MCP to translate those questions into concrete MCP inputs 
(JSON payloads).
   - Evaluate whether the generated MCP input is correct along at least two 
dimensions:
     1. **Did it select the right command(s)?**  
        For example, did it generate the correct sequence of operations like 
`create_role`, then `attach_role_to_principal`, vs. something unrelated?
     2. **Is the payload correct and complete?**  
        This determines whether the command can actually succeed and whether it 
accurately captures the user’s intent (correct fields, types, references, etc.).
   
   Once we have a small but representative benchmark suite, we can compare 
different context strategies (MCP prompt vs. Skill vs. RAG) in a more objective 
way instead of relying on gut feel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to