Re: [I] All emails queried from LDAP on every vote tabulation (tooling-trusted-releases)

via GitHub Fri, 08 May 2026 08:14:03 -0700


asf-tooling commented on issue #882:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/882#issuecomment-4407525322


   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@751c2146`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `voting`, `authentication_authorization`
   
   ### Summary
   Every vote tabulation queries ALL emails from LDAP via 
`util.email_to_uid_map()` called in `atr/tabulate.py::votes()`. The team has 
agreed (per @sbp and @alitheg on 2026-05-04) to cache a JSON mapping of hashed 
email addresses to ASF UIDs on disk (in a `secrets/cached` directory), with the 
background worker producing the cache and the main process (tabulate function) 
consuming it. @sbp confirmed the cache will be erased on shutdown.
   
   ### Where this lives in the code today
   
   #### `atr/tabulate.py` — `votes` (lines 201-216)
   _needs modification_
   This is the call site that triggers an LDAP query for all emails on every 
vote tabulation; it should use a cached mapping instead.
   
   ```python
   async def votes(  # noqa: C901
       committee: sql.Committee | None,
       thread_id: str,
       *,
       excluded_message_ids: set[str] | None = None,
   ) -> tuple[int | None, dict[str, models.tabulate.VoteEmail]]:
       """Tabulate votes."""
       excluded_message_ids_normalized = {
           _message_id_normalize(message_id) for message_id in 
(excluded_message_ids or set()) if message_id
       }
   
       start = time.perf_counter_ns()
       email_to_uid = await util.email_to_uid_map()
       end = time.perf_counter_ns()
       log.info(f"LDAP search took {(end - start) / 1000000} ms")
       log.info(f"Email addresses from LDAP: {len(email_to_uid)}")
   ```
   
   #### `atr/cache.py` — `AdminsCache` (lines 39-59)
   _extension point_
   Existing file-based caching pattern (AdminsCache) that can be replicated for 
the email-to-UID cache, with similar read-from-file and refresh loop mechanics.
   
   ```python
   class AdminsCache(schema.Strict):
       refreshed: datetime.datetime = schema.description("When the cache was 
last refreshed")
       admins: frozenset[str] = schema.description("Set of admin user IDs from 
LDAP")
   
   
   def admins_get() -> frozenset[str]:
       if asfquart.APP is not None:
           return asfquart.APP.extensions.get("admins", frozenset())
       cache_data = _admins_read_from_file()
       if cache_data is None:
           return frozenset()
       return cache_data.admins
   
   
   async def admins_get_async() -> frozenset[str]:
       if asfquart.APP is not None:
           return asfquart.APP.extensions.get("admins", frozenset())
       cache_data = await _admins_read_from_file_async()
       if cache_data is None:
           return frozenset()
       return cache_data.admins
   ```
   
   #### `atr/cache.py` — `_admins_path` (lines 131-132)
   _extension point_
   Shows the existing pattern for cache file location; the email cache would 
use a similar path but in `secrets/cached` as discussed.
   
   ```python
   def _admins_path() -> pathlib.Path:
       return pathlib.Path(config.get().STATE_DIR) / "cache" / "admins.json"
   ```
   
   #### `atr/cache.py` — `admins_refresh_loop` (lines 62-71)
   _extension point_
   Existing refresh loop pattern; per @alitheg, the worker process would 
similarly produce the email cache on disk periodically.
   
   ```python
   async def admins_refresh_loop() -> None:
       while True:
           await asyncio.sleep(ADMINS_POLL_INTERVAL_SECONDS)
           try:
               users = await ldap.fetch_admin_users()
               await admins_save_to_file(users)
               _admins_update_app_extensions(users)
               log.info(f"Admin users cache refreshed: {len(users)} users")
           except Exception as e:
               log.warning(f"Admin refresh failed: {e}")
   ```
   
   #### `atr/ldap.py` — `_search_core` (lines 379-398)
   _currently does this_
   LDAP search infrastructure used by email_to_uid_map; expensive to call 
repeatedly for all users.
   
   ```python
   def _search_core(params: SearchParameters) -> None:
       params.results_list = []
       params.err_msg = None
       params.srv_info = None
       params.detail_err = None
       params.connection = None
   
       server = ldap3.Server(LDAP_SERVER_HOST, use_ssl=True, tls=_tls_config, 
get_info=ldap3.ALL)
       params.srv_info = repr(server)
   
       if params.bind_dn_from_config and params.bind_password_from_config:
           params.connection = ldap3.Connection(
               server,
               user=params.bind_dn_from_config,
               password=params.bind_password_from_config,
               auto_bind=True,
               check_names=False,
           )
       else:
           params.connection = ldap3.Connection(server, auto_bind=True, 
check_names=False)
   ```
   
   ### Where new code would go
   - `atr/cache.py` — after symbol _project_version_update_app_extensions
     Add email_uid cache functions following the existing pattern of 
AdminsCache and project_version cache. This includes: model, path, read, write, 
refresh loop, and startup functions.
   - `atr/server.py` — after startup hooks for other caches
     Register the email cache refresh loop as a background task, and add 
shutdown hook to erase the cache file.
   
   ### Proposed approach
   Following the team's agreed design (@sbp and @alitheg): (1) Add an on-disk 
cache of hashed-email-to-UID mappings in a `secrets/cached` directory under 
STATE_DIR. The cache stores a JSON file mapping SHA-256 hashes of lowercase 
email addresses to ASF UIDs, refreshed periodically by a background worker 
task. (2) Add a `email_uid_get()` function in `atr/cache.py` that reads from 
the file on disk (no in-process memory needed since the main process and 
workers are separate). (3) Modify `atr/tabulate.py::votes()` to try the cached 
mapping first, falling back to a live LDAP query if the cache is missing or 
stale. (4) Register a refresh loop in the worker/manager and add a shutdown 
hook to erase the cache. This depends on #872 for cache invalidation 
notifications. The hashing of emails is a privacy measure since the cache is on 
disk.
   
   ### Suggested patches
   
   #### `atr/cache.py`
   Add email-to-UID cache with hashed emails, file-based persistence, refresh 
loop, and cleanup
   
   ````diff
   --- a/atr/cache.py
   +++ b/atr/cache.py
   @@ -1,5 +1,6 @@
    import asyncio
    import datetime
   +import hashlib
    import pathlib
    from typing import Final
    
   @@ -35,6 +36,9 @@
    # Fifth prime after 3600
    ADMINS_POLL_INTERVAL_SECONDS: Final[int] = 3631
    
   +# Refresh interval for email-to-UID cache
   +EMAIL_UID_POLL_INTERVAL_SECONDS: Final[int] = 3617
   +
    PROJECT_VERSION_POLL_INTERVAL_SECONDS: Final[int] = 307
    
    
   @@ -42,6 +46,12 @@
        admins: frozenset[str] = schema.description("Set of admin user IDs from 
LDAP")
    
    
   +class EmailUidCache(schema.Strict):
   +    refreshed: datetime.datetime = schema.description("When the cache was 
last refreshed")
   +    # Mapping of SHA-256 hashed (lowercase) email to ASF UID
   +    mapping: dict[str, str] = schema.description("Hashed email to ASF UID 
mapping")
   +
   +
    def admins_get() -> frozenset[str]:
        if asfquart.APP is not None:
            return asfquart.APP.extensions.get("admins", frozenset())
   @@ -49,6 +59,72 @@
        return cache_data.admins
    
    
   +def email_uid_hash(email: str) -> str:
   +    """Hash an email address for use as a cache key."""
   +    return hashlib.sha256(email.lower().encode()).hexdigest()
   +
   +
   +def email_uid_lookup(email: str) -> str | None:
   +    """Look up an ASF UID by email address from the on-disk cache."""
   +    cache_data = _email_uid_read_from_file()
   +    if cache_data is None:
   +        return None
   +    hashed = email_uid_hash(email)
   +    return cache_data.mapping.get(hashed)
   +
   +
   +def email_uid_get_all() -> dict[str, str] | None:
   +    """Get the full unhashed-email-to-UID mapping. Returns None if cache 
unavailable.
   +
   +    Note: This returns the raw hashed mapping. Callers needing to look up by
   +    plain email should use email_uid_lookup() instead, or build a reverse 
index.
   +    """
   +    cache_data = _email_uid_read_from_file()
   +    if cache_data is None:
   +        return None
   +    return cache_data.mapping
   +
   +
   +async def email_uid_refresh() -> None:
   +    """Fetch all emails from LDAP and save hashed mapping to disk."""
   +    import atr.util as util
   +
   +    # util.email_to_uid_map() returns dict[str, str] of email -> uid
   +    email_to_uid = await util.email_to_uid_map()
   +    hashed_mapping: dict[str, str] = {}
   +    for email, uid in email_to_uid.items():
   +        hashed_mapping[email_uid_hash(email)] = uid
   +    await email_uid_save_to_file(hashed_mapping)
   +    log.info(f"Email-to-UID cache refreshed: {len(hashed_mapping)} entries")
   +
   +
   +async def email_uid_refresh_loop() -> None:
   +    while True:
   +        await asyncio.sleep(EMAIL_UID_POLL_INTERVAL_SECONDS)
   +        try:
   +            await email_uid_refresh()
   +        except Exception as e:
   +            log.warning(f"Email-to-UID cache refresh failed: {e}")
   +
   +
   +async def email_uid_save_to_file(mapping: dict[str, str]) -> None:
   +    cache_path = _email_uid_path()
   +    cache_path.parent.mkdir(parents=True, exist_ok=True)
   +    cache_data = 
EmailUidCache(refreshed=datetime.datetime.now(datetime.UTC), mapping=mapping)
   +    async with aiofiles.open(cache_path, "w") as f:
   +        await f.write(cache_data.model_dump_json())
   +
   +
   +async def email_uid_startup_load() -> None:
   +    cache_data = _email_uid_read_from_file()
   +    if cache_data is not None:
   +        log.info(f"Loaded email-to-UID cache: {len(cache_data.mapping)} 
entries (refreshed: {cache_data.refreshed})")
   +        return
   +    log.info("No email-to-UID cache found, fetching from LDAP")
   +    try:
   +        await email_uid_refresh()
   +    except Exception as e:
   +        log.warning(f"Failed to populate email-to-UID cache at startup: 
{e}")
   +
   +
   +def email_uid_erase() -> None:
   +    """Erase the email-to-UID cache file. Called on shutdown."""
   +    cache_path = _email_uid_path()
   +    try:
   +        cache_path.unlink(missing_ok=True)
   +        log.info("Email-to-UID cache erased on shutdown")
   +    except OSError as e:
   +        log.warning(f"Failed to erase email-to-UID cache: {e}")
   +
   +
    async def admins_get_async() -> frozenset[str]:
        if asfquart.APP is not None:
            return asfquart.APP.extensions.get("admins", frozenset())
   @@ -118,6 +194,28 @@
    def _admins_path() -> pathlib.Path:
        return pathlib.Path(config.get().STATE_DIR) / "cache" / "admins.json"
    
   +
   +def _email_uid_path() -> pathlib.Path:
   +    return pathlib.Path(config.get().STATE_DIR) / "secrets" / "cached" / 
"email_uid.json"
   +
   +
   +def _email_uid_read_from_file() -> EmailUidCache | None:
   +    cache_path = _email_uid_path()
   +    if not cache_path.exists():
   +        return None
   +    try:
   +        with open(cache_path) as f:
   +            raw_data = f.read()
   +    except OSError as e:
   +        log.warning(f"Failed to read email-to-UID cache: {e}")
   +        return None
   +    try:
   +        return EmailUidCache.model_validate_json(raw_data)
   +    except pydantic.ValidationError as e:
   +        log.warning(f"Failed to parse email-to-UID cache: {e}")
   +        return None
   +
   +
    def _admins_read_from_file() -> AdminsCache | None:
        cache_path = _admins_path()
        if not cache_path.exists():
   ````
   
   #### `atr/tabulate.py`
   Use cached email-to-UID mapping with fallback to live LDAP query
   
   ````diff
   --- a/atr/tabulate.py
   +++ b/atr/tabulate.py
   @@ -22,6 +22,7 @@
    from collections.abc import Generator
    from typing import Protocol
    
   +import atr.cache as cache
    import atr.config as config
    import atr.db as db
    import atr.log as log
   @@ -180,9 +181,17 @@
            _message_id_normalize(message_id) for message_id in 
(excluded_message_ids or set()) if message_id
        }
    
        start = time.perf_counter_ns()
   -    email_to_uid = await util.email_to_uid_map()
   +    # Try the on-disk cache first; fall back to live LDAP query
   +    cached_mapping = cache.email_uid_get_all()
   +    if cached_mapping is not None:
   +        # The cache stores hashed emails; we need plain email -> uid for 
_vote_identity
   +        # So we still need the live map if we want to look up by plain email
   +        # TODO: Consider storing plain emails in the cache (they are 
already in LDAP)
   +        # For now, use the live LDAP query as fallback when cache isn't 
structured for this
   +        email_to_uid = await util.email_to_uid_map()
   +    else:
   +        email_to_uid = await util.email_to_uid_map()
        end = time.perf_counter_ns()
        log.info(f"LDAP search took {(end - start) / 1000000} ms")
        log.info(f"Email addresses from LDAP: {len(email_to_uid)}")
   ````
   
   ### Open questions
   - The hashing requirement creates a tension: the tabulate `_vote_identity()` 
function needs to look up by plain email to find the UID, but the cache stores 
hashed emails. Either the cache needs to store plain emails (less private but 
simpler), or `_vote_identity()` needs to be refactored to hash the email before 
lookup. The team should clarify whether the hashing is for at-rest privacy or 
if a plain-email cache is acceptable.
   - I don't have access to `atr/util.py` to see the implementation of 
`email_to_uid_map()` — the diff for `atr/tabulate.py` depends on understanding 
how the lookup works and whether the interface can be adapted to a hashed cache.
   - Issue #872 is listed as a dependency for cache invalidation notifications 
— is that issue resolved? The refresh loop approach works without it, but 
real-time invalidation would require it.
   - Where exactly should the shutdown hook (calling `email_uid_erase()`) be 
registered? Need to see `atr/server.py` shutdown handling.
   - Should the worker process (separate from the main web process) run the 
refresh loop, or should the main process do it? @alitheg suggested the worker 
produces it and the main process consumes it, which aligns with the file-based 
approach but requires the worker to have its own refresh scheduling.
   
   ### Files examined
   - `atr/cache.py`
   - `atr/ldap.py`
   - `atr/principal.py`
   - `atr/tabulate.py`
   - `tests/unit/test_ldap.py`
   - `atr/get/resolve.py`
   - `atr/storage/writers/vote.py`
   - `atr/tasks/vote.py`
   
   ### Related issues
   This issue appears related to: #873.
   
   _Both involve LDAP caching and session handling improvements, with #882 
depending on #873's pubsub notification mechanism_
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] All emails queried from LDAP on every vote tabulation (tooling-trusted-releases)

Reply via email to