asf-tooling commented on issue #1247:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/1247#issuecomment-4495896998

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@ab610b23`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `voting_and_resolution`, 
`notification_and_messaging`
   
   ### Summary
   The issue requests improving voter identification during vote tabulation by 
using the SPF envelope-from field from `Received-SPF` headers (available via 
the lists.apache.org `source.lua` API) as a more reliable alternative to the 
currently-used CC header heuristic, which is redacted in the `email.lua` JSON 
API. The key function that needs modification is `_vote_identity` in 
`atr/tabulate.py`, and the message-fetching layer (likely 
`util.thread_messages` in `atr/util.py`) needs to supply the new header data, 
either by switching to `source.lua` or supplementing with it.
   
   ### Where this lives in the code today
   
   #### `atr/tabulate.py` — `_vote_identity` (lines 434-453)
   _needs modification_
   This is the core function that determines voter identity from email headers; 
it currently uses CC as the fallback heuristic when the From header shows a 
forwarded 'via' message, and needs to be updated to prefer SPF envelope-from.
   
   ```python
   def _vote_identity(
       from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc: 
list[str]
   ) -> tuple[bool, str, str, str | None]:
       from_email_lower = util.email_from_uid(from_raw)
       if not from_email_lower:
           return False, "", "", None
       name = _name_from_raw(from_raw)
       from_email_lower = from_email_lower.removesuffix(".invalid")
       asf_uid = None
       if from_email_lower.endswith("@apache.org"):
           asf_uid = from_email_lower.split("@")[0]
       else:
           if ("via" in from_raw) and (from_email_lower.replace("@", ".") in 
list_email):
               # Take the last CC, appended by ezmlm, and use that as the 
email. Otherwise, use their name
               if cc:
                   from_email_lower = util.email_from_uid(cc[-1]) or 
from_email_lower
           if from_email_lower in email_to_uid:
               asf_uid = email_to_uid[from_email_lower]
   
       return True, name, from_email_lower, asf_uid
   ```
   
   #### `atr/shared/vote.py` — `message_id_source_archive_url` (lines 25-32)
   _currently does this_
   Shows that the codebase already knows about the source.lua API endpoint; 
this same endpoint (or a thread-level equivalent) is what provides unredacted 
headers including Received-SPF.
   
   ```python
   def message_id_source_archive_url(message_id: str, vote_recipient: str) -> 
str:
       list_id = vote_recipient.replace("@", ".")
       query = urllib.parse.urlencode(
           {"id": f"<{message_id}>", "listid": f"<{list_id}>"},
           quote_via=urllib.parse.quote,
           safe="@",
       )
       return f"https://lists.apache.org/api/source.lua?{query}";
   ```
   
   #### `tests/unit/test_tabulate.py` — `_vote_message` (lines 227-245)
   _needs modification_
   Test helper that constructs mock messages; will need to include an optional 
received_spf field to test the new behavior.
   
   ```python
   def _vote_message(
       *,
       from_raw: str,
       message_id: str,
       archive_mid: str,
       epoch: str,
   ) -> dict[str, object]:
       return {
           "from_raw": from_raw,
           "list_raw": "dev.project.apache.org",
           "cc": "",
           "epoch": epoch,
           "subject": "Re: [VOTE] Release project 1.0.0",
           "body": "+1\n",
           "message-id": message_id,
           "mid": archive_mid,
           "date": "2026-01-01T00:00:00Z",
           "id": f"doc-{epoch}",
       }
   ```
   
   ### Where new code would go
   - `atr/tabulate.py` — after symbol _name_from_raw
     Add a new helper function _extract_envelope_from to parse the 
envelope-from parameter from a Received-SPF header value.
   - `atr/util.py` — near thread_messages function
     The message-fetching layer needs to be updated to retrieve data from 
source.lua (which provides unredacted headers) instead of or in addition to 
email.lua. This is where the Received-SPF header would be parsed from the raw 
email source.
   
   ### Proposed approach
   The change has two parts:
   
   1. **Message data enrichment**: The `util.thread_messages` function (in 
`atr/util.py`, which I don't have source for) currently fetches messages from 
lists.apache.org using what appears to be the `email.lua` JSON API. It needs to 
be changed to use `source.lua` (or supplement with it) to get unredacted CC 
fields and Received-SPF headers. The raw email source from `source.lua` would 
need to be parsed to extract these headers.
   
   2. **Voter identity logic update**: In `atr/tabulate.py`, the 
`_vote_identity` function needs a new parameter for SPF envelope-from data. 
When a message is forwarded via ezmlm (detected by 'via' in from_raw), the 
function should first try to extract the real sender from the `Received-SPF` 
envelope-from, and only fall back to the CC heuristic if no SPF data is 
available. A new helper `_extract_envelope_from` should parse the envelope-from 
parameter from the Received-SPF header string using a regex.
   
   ### Suggested patches
   
   #### `atr/tabulate.py`
   Add envelope-from extraction helper and update _vote_identity to prefer SPF 
envelope-from over CC heuristic
   
   ````diff
   --- a/atr/tabulate.py
   +++ b/atr/tabulate.py
   @@ -1,5 +1,6 @@
    import re
    import time
   +import re
    from collections.abc import Generator
    from typing import Protocol
    
   @@ -195,8 +196,9 @@
            from_raw = msg.get("from_raw", "")
            list_raw = msg.get("list_raw", "")
            cc = msg.get("cc", "").split(",\n")
   -        ok, name, from_email_lower, asf_uid = _vote_identity(from_raw, 
email_to_uid, list_raw, cc)
   +        received_spf = msg.get("received_spf", "")
   +        ok, name, from_email_lower, asf_uid = _vote_identity(from_raw, 
email_to_uid, list_raw, cc, received_spf)
            if not ok:
                continue
    
   @@ -260,10 +262,26 @@
        return name.strip()
    
    
   +_ENVELOPE_FROM_RE = re.compile(r"envelope-from=([^;\s]+)", re.IGNORECASE)
   +
   +
   +def _extract_envelope_from(received_spf: str) -> str:
   +    """Extract the envelope-from email address from a Received-SPF header 
value.
   +
   +    Returns the lowercase email address, or empty string if not found.
   +    """
   +    if not received_spf:
   +        return ""
   +    match = _ENVELOPE_FROM_RE.search(received_spf)
   +    if match:
   +        return match.group(1).strip().lower()
   +    return ""
   +
   +
    def _vote_identity(
   -    from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc: 
list[str]
   +    from_raw: str, email_to_uid: cache.EmailUidLookup, list_email: str, cc: 
list[str], received_spf: str = ""
    ) -> tuple[bool, str, str, str | None]:
        from_email_lower = util.email_from_uid(from_raw)
        if not from_email_lower:
            return False, "", "", None
        name = _name_from_raw(from_raw)
        from_email_lower = from_email_lower.removesuffix(".invalid")
        asf_uid = None
        if from_email_lower.endswith("@apache.org"):
            asf_uid = from_email_lower.split("@")[0]
        else:
            if ("via" in from_raw) and (from_email_lower.replace("@", ".") in 
list_email):
   -            # Take the last CC, appended by ezmlm, and use that as the 
email. Otherwise, use their name
   -            if cc:
   -                from_email_lower = util.email_from_uid(cc[-1]) or 
from_email_lower
   +            # Prefer SPF envelope-from over the CC heuristic
   +            envelope_from = _extract_envelope_from(received_spf)
   +            if envelope_from:
   +                from_email_lower = envelope_from
   +            elif cc:
   +                # Fall back to the last CC, appended by ezmlm
   +                from_email_lower = util.email_from_uid(cc[-1]) or 
from_email_lower
            if from_email_lower in email_to_uid:
                asf_uid = email_to_uid[from_email_lower]
    
        return True, name, from_email_lower, asf_uid
   ````
   
   #### `tests/unit/test_tabulate.py`
   Add test for the new _extract_envelope_from helper and for SPF-based voter 
identity resolution
   
   ````diff
   --- a/tests/unit/test_tabulate.py
   +++ b/tests/unit/test_tabulate.py
   @@ -147,6 +147,42 @@
        assert body_lines[2] == "Of these formal votes, 3 were +1, 0 were -1, 
and 0 were 0."
    
    
   +def test_extract_envelope_from_returns_email() -> None:
   +    spf = "pass (domain of [email protected] designates 1.2.3.4 as permitted 
sender) [email protected];"
   +    assert tabulate._extract_envelope_from(spf) == "[email protected]"
   +
   +
   +def test_extract_envelope_from_returns_empty_on_missing() -> None:
   +    assert tabulate._extract_envelope_from("") == ""
   +    assert tabulate._extract_envelope_from("pass (no envelope)") == ""
   +
   +
   +def test_vote_identity_prefers_spf_over_cc() -> None:
   +    """When SPF envelope-from is available, it should be preferred over 
CC."""
   +    lookup = tabulate.cache.EmailUidLookup(
   +        {
   +            tabulate.cache._email_uid_hash("[email protected]"): "realuser",
   +        }
   +    )
   +    ok, name, from_email, asf_uid = tabulate._vote_identity(
   +        from_raw='"User via List" <dev.project.apache.org>',
   +        email_to_uid=lookup,
   +        list_email="dev.project.apache.org",
   +        cc=["[email protected]"],
   +        received_spf="pass [email protected];",
   +    )
   +    assert ok is True
   +    assert from_email == "[email protected]"
   +    assert asf_uid == "realuser"
   +
   +
   +def test_vote_identity_falls_back_to_cc_without_spf() -> None:
   +    """When no SPF envelope-from is available, fall back to CC."""
   +    lookup = tabulate.cache.EmailUidLookup({})
   +    ok, _name, from_email, _asf_uid = tabulate._vote_identity(
   +        from_raw='"User via List" <dev.project.apache.org>',
   +        email_to_uid=lookup,
   +        list_email="dev.project.apache.org",
   +        cc=["[email protected]"],
   +        received_spf="",
   +    )
   +    assert ok is True
   +    assert from_email == "[email protected]"
   +
   +
    @pytest.mark.asyncio
    async def test_votes_excludes_receipts_by_rfc_message_id_only(monkeypatch: 
pytest.MonkeyPatch) -> None:
   ````
   
   ### Open questions
   - How does `util.thread_messages` currently fetch data from 
lists.apache.org? Is it using email.lua JSON API? The source for atr/util.py 
was not provided, so the exact change to switch to or supplement with 
source.lua is unclear.
   - What is the exact format of the source.lua response? Is it raw RFC 2822 
email text that needs to be parsed with Python's email module, or is it 
structured differently?
   - Does the thread-level API (used for iterating all messages in a thread) 
also have a source.lua equivalent, or do we need to fetch each message 
individually via source.lua?
   - Should the system parse the full raw email source to extract Received-SPF, 
or should it only fetch source.lua for messages that appear to be forwarded via 
ezmlm (optimization)?
   - The test helper `_vote_message` and the actual message fetching need to 
include `received_spf` - what's the exact key name that the util layer will use 
when it parses source.lua responses?
   - How should `cache.EmailUidLookup` and `cache._email_uid_hash` be used in 
the test - need to verify the exact API for constructing test lookups.
   
   ### Files examined
   - `atr/models/tabulate.py`
   - `atr/tabulate.py`
   - `atr/get/resolve.py`
   - `tests/unit/test_tabulate.py`
   - `atr/mail.py`
   - `atr/shared/vote.py`
   - `atr/storage/writers/vote.py`
   - `tests/unit/test_vote.py`
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to