asf-tooling commented on issue #1247:
URL:
https://github.com/apache/tooling-trusted-releases/issues/1247#issuecomment-4495896998
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@ab610b23`
**Type:** `new_feature` • **Classification:** `actionable` •
**Confidence:** `medium`
**Application domain(s):** `voting_and_resolution`,
`notification_and_messaging`
### Summary
The issue requests improving voter identification during vote tabulation by
using the SPF envelope-from field from `Received-SPF` headers (available via
the lists.apache.org `source.lua` API) as a more reliable alternative to the
currently-used CC header heuristic, which is redacted in the `email.lua` JSON
API. The key function that needs modification is `_vote_identity` in
`atr/tabulate.py`, and the message-fetching layer (likely
`util.thread_messages` in `atr/util.py`) needs to supply the new header data,
either by switching to `source.lua` or supplementing with it.
### Where this lives in the code today
#### `atr/tabulate.py` — `_vote_identity` (lines 434-453)
_needs modification_
This is the core function that determines voter identity from email headers;
it currently uses CC as the fallback heuristic when the From header shows a
forwarded 'via' message, and needs to be updated to prefer SPF envelope-from.
```python
def _vote_identity(
from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc:
list[str]
) -> tuple[bool, str, str, str | None]:
from_email_lower = util.email_from_uid(from_raw)
if not from_email_lower:
return False, "", "", None
name = _name_from_raw(from_raw)
from_email_lower = from_email_lower.removesuffix(".invalid")
asf_uid = None
if from_email_lower.endswith("@apache.org"):
asf_uid = from_email_lower.split("@")[0]
else:
if ("via" in from_raw) and (from_email_lower.replace("@", ".") in
list_email):
# Take the last CC, appended by ezmlm, and use that as the
email. Otherwise, use their name
if cc:
from_email_lower = util.email_from_uid(cc[-1]) or
from_email_lower
if from_email_lower in email_to_uid:
asf_uid = email_to_uid[from_email_lower]
return True, name, from_email_lower, asf_uid
```
#### `atr/shared/vote.py` — `message_id_source_archive_url` (lines 25-32)
_currently does this_
Shows that the codebase already knows about the source.lua API endpoint;
this same endpoint (or a thread-level equivalent) is what provides unredacted
headers including Received-SPF.
```python
def message_id_source_archive_url(message_id: str, vote_recipient: str) ->
str:
list_id = vote_recipient.replace("@", ".")
query = urllib.parse.urlencode(
{"id": f"<{message_id}>", "listid": f"<{list_id}>"},
quote_via=urllib.parse.quote,
safe="@",
)
return f"https://lists.apache.org/api/source.lua?{query}"
```
#### `tests/unit/test_tabulate.py` — `_vote_message` (lines 227-245)
_needs modification_
Test helper that constructs mock messages; will need to include an optional
received_spf field to test the new behavior.
```python
def _vote_message(
*,
from_raw: str,
message_id: str,
archive_mid: str,
epoch: str,
) -> dict[str, object]:
return {
"from_raw": from_raw,
"list_raw": "dev.project.apache.org",
"cc": "",
"epoch": epoch,
"subject": "Re: [VOTE] Release project 1.0.0",
"body": "+1\n",
"message-id": message_id,
"mid": archive_mid,
"date": "2026-01-01T00:00:00Z",
"id": f"doc-{epoch}",
}
```
### Where new code would go
- `atr/tabulate.py` — after symbol _name_from_raw
Add a new helper function _extract_envelope_from to parse the
envelope-from parameter from a Received-SPF header value.
- `atr/util.py` — near thread_messages function
The message-fetching layer needs to be updated to retrieve data from
source.lua (which provides unredacted headers) instead of or in addition to
email.lua. This is where the Received-SPF header would be parsed from the raw
email source.
### Proposed approach
The change has two parts:
1. **Message data enrichment**: The `util.thread_messages` function (in
`atr/util.py`, which I don't have source for) currently fetches messages from
lists.apache.org using what appears to be the `email.lua` JSON API. It needs to
be changed to use `source.lua` (or supplement with it) to get unredacted CC
fields and Received-SPF headers. The raw email source from `source.lua` would
need to be parsed to extract these headers.
2. **Voter identity logic update**: In `atr/tabulate.py`, the
`_vote_identity` function needs a new parameter for SPF envelope-from data.
When a message is forwarded via ezmlm (detected by 'via' in from_raw), the
function should first try to extract the real sender from the `Received-SPF`
envelope-from, and only fall back to the CC heuristic if no SPF data is
available. A new helper `_extract_envelope_from` should parse the envelope-from
parameter from the Received-SPF header string using a regex.
### Suggested patches
#### `atr/tabulate.py`
Add envelope-from extraction helper and update _vote_identity to prefer SPF
envelope-from over CC heuristic
````diff
--- a/atr/tabulate.py
+++ b/atr/tabulate.py
@@ -1,5 +1,6 @@
import re
import time
+import re
from collections.abc import Generator
from typing import Protocol
@@ -195,8 +196,9 @@
from_raw = msg.get("from_raw", "")
list_raw = msg.get("list_raw", "")
cc = msg.get("cc", "").split(",\n")
- ok, name, from_email_lower, asf_uid = _vote_identity(from_raw,
email_to_uid, list_raw, cc)
+ received_spf = msg.get("received_spf", "")
+ ok, name, from_email_lower, asf_uid = _vote_identity(from_raw,
email_to_uid, list_raw, cc, received_spf)
if not ok:
continue
@@ -260,10 +262,26 @@
return name.strip()
+_ENVELOPE_FROM_RE = re.compile(r"envelope-from=([^;\s]+)", re.IGNORECASE)
+
+
+def _extract_envelope_from(received_spf: str) -> str:
+ """Extract the envelope-from email address from a Received-SPF header
value.
+
+ Returns the lowercase email address, or empty string if not found.
+ """
+ if not received_spf:
+ return ""
+ match = _ENVELOPE_FROM_RE.search(received_spf)
+ if match:
+ return match.group(1).strip().lower()
+ return ""
+
+
def _vote_identity(
- from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc:
list[str]
+ from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc:
list[str], received_spf: str = ""
) -> tuple[bool, str, str, str | None]:
from_email_lower = util.email_from_uid(from_raw)
if not from_email_lower:
return False, "", "", None
name = _name_from_raw(from_raw)
from_email_lower = from_email_lower.removesuffix(".invalid")
asf_uid = None
if from_email_lower.endswith("@apache.org"):
asf_uid = from_email_lower.split("@")[0]
else:
if ("via" in from_raw) and (from_email_lower.replace("@", ".") in
list_email):
- # Take the last CC, appended by ezmlm, and use that as the
email. Otherwise, use their name
- if cc:
- from_email_lower = util.email_from_uid(cc[-1]) or
from_email_lower
+ # Prefer SPF envelope-from over the CC heuristic
+ envelope_from = _extract_envelope_from(received_spf)
+ if envelope_from:
+ from_email_lower = envelope_from
+ elif cc:
+ # Fall back to the last CC, appended by ezmlm
+ from_email_lower = util.email_from_uid(cc[-1]) or
from_email_lower
if from_email_lower in email_to_uid:
asf_uid = email_to_uid[from_email_lower]
return True, name, from_email_lower, asf_uid
````
#### `tests/unit/test_tabulate.py`
Add test for the new _extract_envelope_from helper and for SPF-based voter
identity resolution
````diff
--- a/tests/unit/test_tabulate.py
+++ b/tests/unit/test_tabulate.py
@@ -147,6 +147,42 @@
assert body_lines[2] == "Of these formal votes, 3 were +1, 0 were -1,
and 0 were 0."
+def test_extract_envelope_from_returns_email() -> None:
+ spf = "pass (domain of [email protected] designates 1.2.3.4 as permitted
sender) [email protected];"
+ assert tabulate._extract_envelope_from(spf) == "[email protected]"
+
+
+def test_extract_envelope_from_returns_empty_on_missing() -> None:
+ assert tabulate._extract_envelope_from("") == ""
+ assert tabulate._extract_envelope_from("pass (no envelope)") == ""
+
+
+def test_vote_identity_prefers_spf_over_cc() -> None:
+ """When SPF envelope-from is available, it should be preferred over
CC."""
+ lookup = tabulate.cache.EmailUidLookup(
+ {
+ tabulate.cache._email_uid_hash("[email protected]"): "realuser",
+ }
+ )
+ ok, name, from_email, asf_uid = tabulate._vote_identity(
+ from_raw='"User via List" <dev.project.apache.org>',
+ email_to_uid=lookup,
+ list_email="dev.project.apache.org",
+ cc=["[email protected]"],
+ received_spf="pass [email protected];",
+ )
+ assert ok is True
+ assert from_email == "[email protected]"
+ assert asf_uid == "realuser"
+
+
+def test_vote_identity_falls_back_to_cc_without_spf() -> None:
+ """When no SPF envelope-from is available, fall back to CC."""
+ lookup = tabulate.cache.EmailUidLookup({})
+ ok, _name, from_email, _asf_uid = tabulate._vote_identity(
+ from_raw='"User via List" <dev.project.apache.org>',
+ email_to_uid=lookup,
+ list_email="dev.project.apache.org",
+ cc=["[email protected]"],
+ received_spf="",
+ )
+ assert ok is True
+ assert from_email == "[email protected]"
+
+
@pytest.mark.asyncio
async def test_votes_excludes_receipts_by_rfc_message_id_only(monkeypatch:
pytest.MonkeyPatch) -> None:
````
### Open questions
- How does `util.thread_messages` currently fetch data from
lists.apache.org? Is it using email.lua JSON API? The source for atr/util.py
was not provided, so the exact change to switch to or supplement with
source.lua is unclear.
- What is the exact format of the source.lua response? Is it raw RFC 2822
email text that needs to be parsed with Python's email module, or is it
structured differently?
- Does the thread-level API (used for iterating all messages in a thread)
also have a source.lua equivalent, or do we need to fetch each message
individually via source.lua?
- Should the system parse the full raw email source to extract Received-SPF,
or should it only fetch source.lua for messages that appear to be forwarded via
ezmlm (optimization)?
- The test helper `_vote_message` and the actual message fetching need to
include `received_spf` - what's the exact key name that the util layer will use
when it parses source.lua responses?
- How should `cache.EmailUidLookup` and `cache._email_uid_hash` be used in
the test - need to verify the exact API for constructing test lookups.
### Files examined
- `atr/models/tabulate.py`
- `atr/tabulate.py`
- `atr/get/resolve.py`
- `tests/unit/test_tabulate.py`
- `atr/mail.py`
- `atr/shared/vote.py`
- `atr/storage/writers/vote.py`
- `tests/unit/test_vote.py`
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]