timsaucer commented on PR #1497:
URL:
https://github.com/apache/datafusion-python/pull/1497#issuecomment-4270369463
With my latest push I have a folder that contains only the text descriptions
of the TPC-H queries and I gave it this guidance:
Review the @README.md and @AGENTS.md in this directory. Each of the problem
statements is listed in @problems/ . I want you to generate solutions for each
problem statement. However when you do this you are forbidden from making any
changes to your solution after your first evaluation. This is an attempt to
test that our agents file contains all of the necessary instructions, so you
should be able to get each one right on the first attempt.
The contents of README.md was:
# DataFusion Python - TPC-H Queries
## Overview
This project implements TPC-H benchmark queries using idiomatic
datafusion-python code. The goal is to translate natural language problem
descriptions into DataFrame API queries, **not** to transliterate SQL into
Python.
## Data
TPC-H parquet files are located in the `data/` directory:
- `customer.parquet`
- `lineitem.parquet`
- `nation.parquet`
- `orders.parquet`
- `part.parquet`
- `partsupp.parquet`
- `region.parquet`
- `supplier.parquet`
## Approach
Each query should be written as idiomatic datafusion-python, using the
DataFrame
API with fluent chaining, `col()`/`lit()` expressions, and functions from
the `functions` module. Solutions should keep data in Arrow-native formats and
avoid unnecessary conversions to Python types.
## Allowed Sources
- `AGENTS.md` — local copy of the datafusion-python DataFrame API guide
- datafusion-python documentation at https://datafusion.apache.org/python/
- Problem descriptions in the `problems/` directory
## Restrictions
- **Do not use or analyze any TPC-H SQL queries.** Solutions must be derived
from the natural language problem descriptions alone, not by translating SQL.
Additionally I have a CLAUDE.md file with:
Do not store auto-memory for this folder. The user is developing and testing
skills here, and cross-session memory may bias how skills get written or
evaluated between runs. Do not write to
`~/.claude/projects/-Users-tsaucer-working-agentic-dfpython/memory/` — no
feedback, user, project, or reference memories.
Do not read prior query solutions under `solutions/` when writing a new
query. Each query must be derived only from `AGENTS.md` (and the resources it
points to) plus the problem description in `problems/`. The goal is to build up
`AGENTS.md` as the sole durable guide; cross-referencing other solutions biases
new queries toward patterns that may or may not be captured in the guide, and
hides gaps we want to surface. This applies even for "style matching" — if a
style convention matters, it belongs in `AGENTS.md`, not inferred from siblings.
Whenever you hit a problem while generating a query — a DataFusion error, a
surprising planner rejection, a type mismatch, an API quirk not covered by the
existing guide — after resolving it, propose a concrete addition or edit to
`AGENTS.md` so a future agent does not repeat the mistake. Phrase the proposal
as a short recommendation (the rule, a minimal wrong/right example, and where
it should live in the file) and wait for user approval before editing
`AGENTS.md`. Since memory is disabled for this folder, `AGENTS.md` is the only
durable channel for these lessons.
# Results
Using this it created all 22 TPC-H queries. I then validated that they all
work at scale factor 1 and produce the expected results. I also checked each
file to make sure it created idiomatic code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]