andreahlert opened a new issue, #327:
URL: https://github.com/apache/airflow-steward/issues/327

   # `docs/mode-economics.md`: provenance, tokenizer disclosure, and 
measurement methodology
   
   ## Context
   
   `docs/mode-economics.md` (added in #253) is positioned by [MISSION.md § 
Affordability](../MISSION.md#affordability-and-vendor-neutrality--the-public-good-commitment)
 as quantitative input for adoption decisions and for the long-term ASF 
inference-endpoint capacity plan. #281 then wired the contributor expectation 
to keep the page in sync with skill changes (`AGENTS.md § Keeping evals and 
mode-economics in sync`).
   
   That sequence resolves the **future** sync gap. It does not address the 
**shipped** state of the page, which still presents bot-generated ranges as if 
they were measured, with no provenance signal. #326 fixed three concrete bugs 
in passing; this issue collects the structural items I deliberately kept out of 
that PR so the project can decide direction before any further edit lands.
   
   The framing I'd suggest: every concern below is a low-cost annotation or 
table-shape change. None of them require running new evals; all of them become 
easier to populate honestly once #281's `--cli` harness has run against a real 
skill suite.
   
   ## Findings
   
   ### 1. No provenance / freshness signal on the page itself
   
   The page reports per-skill / per-mode token counts but never states:
   
   - Which tokenizer the counts are measured with. The corrected SKILL.md 
anchor (lines 57–59) says `measured with cl100k_base`, but the per-mode / 
per-skill tables further down never restate that assumption, and the 
cross-class recommendations (Small ~7B–13B, Mid-tier ~70B, Large frontier) are 
vendor-agnostic — `cl100k_base` is OpenAI-specific and diverges from Claude's 
tokenizer / Llama BPE by ~10–40 % depending on content (code-heavy and 
non-English content drift the most).
   - When the counts were last measured.
   - Which subset of skills is covered. The catalogue is currently ~26 skills; 
the page names ~13.
   
   Suggested fix: a single banner / front-matter block at the top of the page:
   
   > _Measured with `cl100k_base` on `<commit>` / `<date>`. ±~30 % drift 
expected across Claude / Llama / local quantised tokenizers. Coverage: N of M 
skills; remainder use estimated ranges flagged in the tables._
   
   ### 2. Measured rows and hypothetical rows share one table
   
   Two rows in the Pairing section give concrete numeric ranges for skills that 
don't exist on `main`:
   
   - `pairing-self-review` — labeled "Estimated; skill experimental". Per 
#253's own description, this is _"a name in a table, not a link"_ — an unmerged 
sibling branch.
   - `Multi-agent review pipeline` — labeled "Estimated; future skill."
   
   The labeling is visible in the Notes column, but the numeric `Token range` 
column reads as commensurable with the measured Triage / Drafting rows above. A 
reader scanning the tables for budget planning has to spot the "Estimated; 
future" tag to discount the row.
   
   Suggested fix: split the Pairing table into a "Measured" subsection and a 
"Planned / hypothetical (estimates pending implementation)" subsection. Same 
numbers, different headings. Or: replace concrete bounds on hypothetical rows 
with `TBD (expected O(10k–100k))`.
   
   ### 3. Ranges have shape but no anchor
   
   The Drafting ranges span 5–7× (e.g. `30 000–150 000` for 
`security-issue-fix` code-fix; `30 000–200 000` for the multi-agent pipeline). 
The Affordability framing implies a maintainer can use these to plan; in 
practice a 5× spread is ceiling-budgeting only — it doesn't help rank modes or 
estimate typical cost.
   
   Suggested fix: add a `p50` / `typical` column alongside `Token range`, and 
one worked example per mode (e.g. "1-file fix on the issue-fix-workflow corpus 
≈ 35 000; multi-package fix ≈ 130 000"). Keep the range; add the anchor inside 
it. #281's `--cli` runs across the eval suites are the natural data source for 
both columns once they've been exercised.
   
   ### 4. Cross-class multiplier table needs a date and named comparators
   
   `Small-class models are typically 10–50× cheaper per token than Large-class 
models at hosted-API rates. Mid-tier sits at roughly 3–10× cheaper than Large.` 
The structural claim (small cheaper than large by ~1 OoM) is durable; the 
specific ratio drifts as vendors reprice. No date, no named comparators 
(Haiku/Opus? Flash/Pro? GPT-mini/full?).
   
   Suggested fix: one-line caption — `As of YYYY-MM. Indicative ratios across 
hosted-API rates from <vendors named>. Check current pricing pages.`
   
   ### 5. Pairing payer ambiguity
   
   The Pairing section opens with _"runs in the developer's own development 
cycle, not on project infrastructure — cost is per-developer-session"_ but then 
provides a token rule-of-thumb for adoption planning. The page targets _"a 
maintainer evaluating adoption"_, who has to know whether this line hits the 
project budget, the contributor's personal budget, or a hybrid reimbursement 
model.
   
   Suggested fix: one sentence after the "per-developer-session" framing — 
`Whether the project reimburses contributors or treats Pairing as personal 
tooling is a project-level policy decision, outside the scope of this page.`
   
   ### 6. Auto-merge subsection is empty scaffolding
   
   The Auto-merge subsection contains exactly two sentences: `**Status: off.** 
Auto-merge is not implemented; it has no token cost.` plus a cross-ref. A full 
H3 with a "Status: off" bolded line mimics the visual weight of the substantive 
Triage / Drafting / Mentoring / Pairing sections.
   
   Suggested fix: collapse to a one-line entry under a `Modes not yet covered` 
subsection at the end, preserving the `#auto-merge` anchor for inbound 
deep-links from `modes.md`.
   
   ## What this issue is **not**
   
   - Not a request to retract the page. The page provides a useful shape and an 
Affordability-mandate scaffolding that should stay.
   - Not a request to wait for perfect data before annotating. Items 1, 4, 5, 6 
are all editorial; only items 2 and 3 benefit from #281's data.
   - Not adversarial to #253 or #281. #281's harness is the right answer for 
sync; this issue is about the in-page presentation that #281 doesn't touch.
   
   ## Suggested sequencing
   
   1. Editorial pass (items 1, 4, 5, 6) — one PR, no new measurements.
   2. Once `--cli` has run against the real skill suites (the open follow-up in 
#281's test plan), backfill items 2 and 3 with measured `p50` + split 
measured/planned tables.
   
   Happy to take the editorial PR if useful; flagging here first so direction 
can be discussed before I touch the file again.
   
   cc @justinmclean @potiuk @choo121600
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to