asf-tooling commented on issue #882:
URL:
https://github.com/apache/tooling-trusted-releases/issues/882#issuecomment-4407525322
<!-- gofannon-issue-triage-bot v2 -->
**Automated triage** — analyzed at `main@751c2146`
**Type:** `new_feature` • **Classification:** `actionable` •
**Confidence:** `medium`
**Application domain(s):** `voting`, `authentication_authorization`
### Summary
Every vote tabulation queries ALL emails from LDAP via
`util.email_to_uid_map()` called in `atr/tabulate.py::votes()`. The team has
agreed (per @sbp and @alitheg on 2026-05-04) to cache a JSON mapping of hashed
email addresses to ASF UIDs on disk (in a `secrets/cached` directory), with the
background worker producing the cache and the main process (tabulate function)
consuming it. @sbp confirmed the cache will be erased on shutdown.
### Where this lives in the code today
#### `atr/tabulate.py` — `votes` (lines 201-216)
_needs modification_
This is the call site that triggers an LDAP query for all emails on every
vote tabulation; it should use a cached mapping instead.
```python
async def votes( # noqa: C901
committee: sql.Committee | None,
thread_id: str,
*,
excluded_message_ids: set[str] | None = None,
) -> tuple[int | None, dict[str, models.tabulate.VoteEmail]]:
"""Tabulate votes."""
excluded_message_ids_normalized = {
_message_id_normalize(message_id) for message_id in
(excluded_message_ids or set()) if message_id
}
start = time.perf_counter_ns()
email_to_uid = await util.email_to_uid_map()
end = time.perf_counter_ns()
log.info(f"LDAP search took {(end - start) / 1000000} ms")
log.info(f"Email addresses from LDAP: {len(email_to_uid)}")
```
#### `atr/cache.py` — `AdminsCache` (lines 39-59)
_extension point_
Existing file-based caching pattern (AdminsCache) that can be replicated for
the email-to-UID cache, with similar read-from-file and refresh loop mechanics.
```python
class AdminsCache(schema.Strict):
refreshed: datetime.datetime = schema.description("When the cache was
last refreshed")
admins: frozenset[str] = schema.description("Set of admin user IDs from
LDAP")
def admins_get() -> frozenset[str]:
if asfquart.APP is not None:
return asfquart.APP.extensions.get("admins", frozenset())
cache_data = _admins_read_from_file()
if cache_data is None:
return frozenset()
return cache_data.admins
async def admins_get_async() -> frozenset[str]:
if asfquart.APP is not None:
return asfquart.APP.extensions.get("admins", frozenset())
cache_data = await _admins_read_from_file_async()
if cache_data is None:
return frozenset()
return cache_data.admins
```
#### `atr/cache.py` — `_admins_path` (lines 131-132)
_extension point_
Shows the existing pattern for cache file location; the email cache would
use a similar path but in `secrets/cached` as discussed.
```python
def _admins_path() -> pathlib.Path:
return pathlib.Path(config.get().STATE_DIR) / "cache" / "admins.json"
```
#### `atr/cache.py` — `admins_refresh_loop` (lines 62-71)
_extension point_
Existing refresh loop pattern; per @alitheg, the worker process would
similarly produce the email cache on disk periodically.
```python
async def admins_refresh_loop() -> None:
while True:
await asyncio.sleep(ADMINS_POLL_INTERVAL_SECONDS)
try:
users = await ldap.fetch_admin_users()
await admins_save_to_file(users)
_admins_update_app_extensions(users)
log.info(f"Admin users cache refreshed: {len(users)} users")
except Exception as e:
log.warning(f"Admin refresh failed: {e}")
```
#### `atr/ldap.py` — `_search_core` (lines 379-398)
_currently does this_
LDAP search infrastructure used by email_to_uid_map; expensive to call
repeatedly for all users.
```python
def _search_core(params: SearchParameters) -> None:
params.results_list = []
params.err_msg = None
params.srv_info = None
params.detail_err = None
params.connection = None
server = ldap3.Server(LDAP_SERVER_HOST, use_ssl=True, tls=_tls_config,
get_info=ldap3.ALL)
params.srv_info = repr(server)
if params.bind_dn_from_config and params.bind_password_from_config:
params.connection = ldap3.Connection(
server,
user=params.bind_dn_from_config,
password=params.bind_password_from_config,
auto_bind=True,
check_names=False,
)
else:
params.connection = ldap3.Connection(server, auto_bind=True,
check_names=False)
```
### Where new code would go
- `atr/cache.py` — after symbol _project_version_update_app_extensions
Add email_uid cache functions following the existing pattern of
AdminsCache and project_version cache. This includes: model, path, read, write,
refresh loop, and startup functions.
- `atr/server.py` — after startup hooks for other caches
Register the email cache refresh loop as a background task, and add
shutdown hook to erase the cache file.
### Proposed approach
Following the team's agreed design (@sbp and @alitheg): (1) Add an on-disk
cache of hashed-email-to-UID mappings in a `secrets/cached` directory under
STATE_DIR. The cache stores a JSON file mapping SHA-256 hashes of lowercase
email addresses to ASF UIDs, refreshed periodically by a background worker
task. (2) Add a `email_uid_get()` function in `atr/cache.py` that reads from
the file on disk (no in-process memory needed since the main process and
workers are separate). (3) Modify `atr/tabulate.py::votes()` to try the cached
mapping first, falling back to a live LDAP query if the cache is missing or
stale. (4) Register a refresh loop in the worker/manager and add a shutdown
hook to erase the cache. This depends on #872 for cache invalidation
notifications. The hashing of emails is a privacy measure since the cache is on
disk.
### Suggested patches
#### `atr/cache.py`
Add email-to-UID cache with hashed emails, file-based persistence, refresh
loop, and cleanup
````diff
--- a/atr/cache.py
+++ b/atr/cache.py
@@ -1,5 +1,6 @@
import asyncio
import datetime
+import hashlib
import pathlib
from typing import Final
@@ -35,6 +36,9 @@
# Fifth prime after 3600
ADMINS_POLL_INTERVAL_SECONDS: Final[int] = 3631
+# Refresh interval for email-to-UID cache
+EMAIL_UID_POLL_INTERVAL_SECONDS: Final[int] = 3617
+
PROJECT_VERSION_POLL_INTERVAL_SECONDS: Final[int] = 307
@@ -42,6 +46,12 @@
admins: frozenset[str] = schema.description("Set of admin user IDs from
LDAP")
+class EmailUidCache(schema.Strict):
+ refreshed: datetime.datetime = schema.description("When the cache was
last refreshed")
+ # Mapping of SHA-256 hashed (lowercase) email to ASF UID
+ mapping: dict[str, str] = schema.description("Hashed email to ASF UID
mapping")
+
+
def admins_get() -> frozenset[str]:
if asfquart.APP is not None:
return asfquart.APP.extensions.get("admins", frozenset())
@@ -49,6 +59,72 @@
return cache_data.admins
+def email_uid_hash(email: str) -> str:
+ """Hash an email address for use as a cache key."""
+ return hashlib.sha256(email.lower().encode()).hexdigest()
+
+
+def email_uid_lookup(email: str) -> str | None:
+ """Look up an ASF UID by email address from the on-disk cache."""
+ cache_data = _email_uid_read_from_file()
+ if cache_data is None:
+ return None
+ hashed = email_uid_hash(email)
+ return cache_data.mapping.get(hashed)
+
+
+def email_uid_get_all() -> dict[str, str] | None:
+ """Get the full unhashed-email-to-UID mapping. Returns None if cache
unavailable.
+
+ Note: This returns the raw hashed mapping. Callers needing to look up by
+ plain email should use email_uid_lookup() instead, or build a reverse
index.
+ """
+ cache_data = _email_uid_read_from_file()
+ if cache_data is None:
+ return None
+ return cache_data.mapping
+
+
+async def email_uid_refresh() -> None:
+ """Fetch all emails from LDAP and save hashed mapping to disk."""
+ import atr.util as util
+
+ # util.email_to_uid_map() returns dict[str, str] of email -> uid
+ email_to_uid = await util.email_to_uid_map()
+ hashed_mapping: dict[str, str] = {}
+ for email, uid in email_to_uid.items():
+ hashed_mapping[email_uid_hash(email)] = uid
+ await email_uid_save_to_file(hashed_mapping)
+ log.info(f"Email-to-UID cache refreshed: {len(hashed_mapping)} entries")
+
+
+async def email_uid_refresh_loop() -> None:
+ while True:
+ await asyncio.sleep(EMAIL_UID_POLL_INTERVAL_SECONDS)
+ try:
+ await email_uid_refresh()
+ except Exception as e:
+ log.warning(f"Email-to-UID cache refresh failed: {e}")
+
+
+async def email_uid_save_to_file(mapping: dict[str, str]) -> None:
+ cache_path = _email_uid_path()
+ cache_path.parent.mkdir(parents=True, exist_ok=True)
+ cache_data =
EmailUidCache(refreshed=datetime.datetime.now(datetime.UTC), mapping=mapping)
+ async with aiofiles.open(cache_path, "w") as f:
+ await f.write(cache_data.model_dump_json())
+
+
+async def email_uid_startup_load() -> None:
+ cache_data = _email_uid_read_from_file()
+ if cache_data is not None:
+ log.info(f"Loaded email-to-UID cache: {len(cache_data.mapping)}
entries (refreshed: {cache_data.refreshed})")
+ return
+ log.info("No email-to-UID cache found, fetching from LDAP")
+ try:
+ await email_uid_refresh()
+ except Exception as e:
+ log.warning(f"Failed to populate email-to-UID cache at startup:
{e}")
+
+
+def email_uid_erase() -> None:
+ """Erase the email-to-UID cache file. Called on shutdown."""
+ cache_path = _email_uid_path()
+ try:
+ cache_path.unlink(missing_ok=True)
+ log.info("Email-to-UID cache erased on shutdown")
+ except OSError as e:
+ log.warning(f"Failed to erase email-to-UID cache: {e}")
+
+
async def admins_get_async() -> frozenset[str]:
if asfquart.APP is not None:
return asfquart.APP.extensions.get("admins", frozenset())
@@ -118,6 +194,28 @@
def _admins_path() -> pathlib.Path:
return pathlib.Path(config.get().STATE_DIR) / "cache" / "admins.json"
+
+def _email_uid_path() -> pathlib.Path:
+ return pathlib.Path(config.get().STATE_DIR) / "secrets" / "cached" /
"email_uid.json"
+
+
+def _email_uid_read_from_file() -> EmailUidCache | None:
+ cache_path = _email_uid_path()
+ if not cache_path.exists():
+ return None
+ try:
+ with open(cache_path) as f:
+ raw_data = f.read()
+ except OSError as e:
+ log.warning(f"Failed to read email-to-UID cache: {e}")
+ return None
+ try:
+ return EmailUidCache.model_validate_json(raw_data)
+ except pydantic.ValidationError as e:
+ log.warning(f"Failed to parse email-to-UID cache: {e}")
+ return None
+
+
def _admins_read_from_file() -> AdminsCache | None:
cache_path = _admins_path()
if not cache_path.exists():
````
#### `atr/tabulate.py`
Use cached email-to-UID mapping with fallback to live LDAP query
````diff
--- a/atr/tabulate.py
+++ b/atr/tabulate.py
@@ -22,6 +22,7 @@
from collections.abc import Generator
from typing import Protocol
+import atr.cache as cache
import atr.config as config
import atr.db as db
import atr.log as log
@@ -180,9 +181,17 @@
_message_id_normalize(message_id) for message_id in
(excluded_message_ids or set()) if message_id
}
start = time.perf_counter_ns()
- email_to_uid = await util.email_to_uid_map()
+ # Try the on-disk cache first; fall back to live LDAP query
+ cached_mapping = cache.email_uid_get_all()
+ if cached_mapping is not None:
+ # The cache stores hashed emails; we need plain email -> uid for
_vote_identity
+ # So we still need the live map if we want to look up by plain email
+ # TODO: Consider storing plain emails in the cache (they are
already in LDAP)
+ # For now, use the live LDAP query as fallback when cache isn't
structured for this
+ email_to_uid = await util.email_to_uid_map()
+ else:
+ email_to_uid = await util.email_to_uid_map()
end = time.perf_counter_ns()
log.info(f"LDAP search took {(end - start) / 1000000} ms")
log.info(f"Email addresses from LDAP: {len(email_to_uid)}")
````
### Open questions
- The hashing requirement creates a tension: the tabulate `_vote_identity()`
function needs to look up by plain email to find the UID, but the cache stores
hashed emails. Either the cache needs to store plain emails (less private but
simpler), or `_vote_identity()` needs to be refactored to hash the email before
lookup. The team should clarify whether the hashing is for at-rest privacy or
if a plain-email cache is acceptable.
- I don't have access to `atr/util.py` to see the implementation of
`email_to_uid_map()` — the diff for `atr/tabulate.py` depends on understanding
how the lookup works and whether the interface can be adapted to a hashed cache.
- Issue #872 is listed as a dependency for cache invalidation notifications
— is that issue resolved? The refresh loop approach works without it, but
real-time invalidation would require it.
- Where exactly should the shutdown hook (calling `email_uid_erase()`) be
registered? Need to see `atr/server.py` shutdown handling.
- Should the worker process (separate from the main web process) run the
refresh loop, or should the main process do it? @alitheg suggested the worker
produces it and the main process consumes it, which aligns with the file-based
approach but requires the worker to have its own refresh scheduling.
### Files examined
- `atr/cache.py`
- `atr/ldap.py`
- `atr/principal.py`
- `atr/tabulate.py`
- `tests/unit/test_ldap.py`
- `atr/get/resolve.py`
- `atr/storage/writers/vote.py`
- `atr/tasks/vote.py`
### Related issues
This issue appears related to: #873.
_Both involve LDAP caching and session handling improvements, with #882
depending on #873's pubsub notification mechanism_
---
*Draft from a triage agent. A human reviewer should validate before merging
any change. The agent did not run tests or verify diffs apply.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]