GitHub user weiqingy edited a discussion: [Feature] Per-Event-Type Configurable Log Levels for Event Log
GitHub issue: https://github.com/apache/flink-agents/issues/541 ## Motivation The event log captures every event flowing through an agent for debugging, auditing, and observability. Today the only filtering mechanism is `EventFilter`, a binary accept/reject predicate. This makes it impossible to: - Log some event types at full detail while keeping others concise. - Reduce log volume in production without losing visibility into critical event types. - Adjust verbosity for a single event type at job submission time without re-specifying an entire filter. This design introduces **per-event-type configurable log levels** so that operators can independently control the verbosity of each event type. ## Log Levels Three log levels, ordered from least to most verbose: | Level | Behavior | |---|---| | `OFF` | Event is not logged at all. | | `STANDARD` | Event is logged. Details may be omitted to keep logs concise (see [Truncation Strategy](#truncation-strategy-standard-level)). | | `VERBOSE` | Event is logged with full detail. Nothing is omitted. | The default level for all event types is `STANDARD` with truncation active (default `max-length` of 4096 characters). This means `STANDARD` and `VERBOSE` have distinct behaviors out of the box. ### Why Default to STANDARD with Active Truncation | Approach | Out-of-the-box Behavior | Backward Compatible? | Semantic Clarity | |---|---|---|---| | **STANDARD + active truncation (chosen)** | Events logged with long fields truncated automatically. | No. Existing logs may be truncated after upgrade. | High. STANDARD and VERBOSE are immediately distinct. | | VERBOSE (no truncation) | All events logged in full, identical to today. | Yes. Zero behavior change. | Medium. Users must opt-in to STANDARD and set max-length to see benefits. | | STANDARD + max-length=0 | All events logged in full, identical to today. | Yes. Zero behavior change. | Low. STANDARD and VERBOSE are identical until max-length is modified. | We chose active truncation because: - **Semantic clarity**: `STANDARD` and `VERBOSE` mean different things from day one. No configuration required to see the distinction. - **Simple opt-in path**: Operators who need full detail for specific event types simply set those types to `VERBOSE`. No need to understand or configure `max-length`. - **Practical benefit by default**: AI agent events frequently contain very long LLM responses (10K+ characters) and tool outputs. Truncation at 4096 characters keeps logs usable for monitoring without excessive disk usage. - **Backward compatibility trade-off**: Existing users will see truncated logs after upgrade. This is mitigated by setting `event-log.level: VERBOSE` to restore the previous behavior, or setting specific event types to `VERBOSE` for full detail where needed. ## Configuration ### Config Key Pattern Per-event-type settings use the pattern: ``` event-log.<EVENT_TYPE>.<property> ``` The event type appears in the middle, and the property name appears at the tail. This structure groups all settings for a given event type together and allows future per-type properties (e.g., routing events to different logger destinations) without restructuring the key namespace. This follows standard hierarchical logger configuration conventions. **Future extensibility example:** ```yaml event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level: VERBOSE event-log.org.apache.flink.agents.api.event.ChatRequestEvent.max-length: 8192 # future event-log.org.apache.flink.agents.api.event.ChatRequestEvent.logger: kafka # future ``` ### Event Type Names in Config Keys Config keys use the **fully-qualified class name** of the event type. This avoids ambiguity when different packages define event classes with the same simple name. ``` event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level=VERBOSE event-log.org.apache.flink.agents.api.InputEvent.level=OFF ``` ### Hierarchical Inheritance Log level resolution follows **hierarchy inheritance**. The dot-separated event type name defines a natural hierarchy. When an event type has no exact config match, the level is inherited from the nearest configured ancestor. The root config key `event-log.level` serves as the global default — no special `default` keyword is needed. **Resolution order** (most specific wins): 1. **Exact match**: `event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level` 2. **Parent package**: `event-log.org.apache.flink.agents.api.event.level` 3. **Grandparent package**: `event-log.org.apache.flink.agents.api.level` 4. ... _(walks up the hierarchy)_ 5. **Root**: `event-log.level` 6. **Built-in default**: `STANDARD` (if `event-log.level` is not configured) **Example**: Given these event types: ``` org.apache.flink.agents.api.InputEvent org.apache.flink.agents.api.OutputEvent org.apache.flink.agents.api.event.ChatRequestEvent org.apache.flink.agents.api.event.ToolRequestEvent ``` And this config: ```yaml event-log.level: STANDARD # root default event-log.org.apache.flink.agents.api.event.level: OFF # package-level event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level: VERBOSE # exact type ``` Resolution: | Event Type | Resolved Level | Why | |---|---|---| | `...api.event.ChatRequestEvent` | `VERBOSE` | Exact match | | `...api.event.ToolRequestEvent` | `OFF` | No exact match → inherits from `...api.event` | | `...api.InputEvent` | `STANDARD` | No exact match, no `...api` key → inherits from root | | `...api.OutputEvent` | `STANDARD` | Same as above | ### Complete Config Key Reference | Config Key | Type | Default | Description | |---|---|---|---| | `event-log.level` | String (`OFF` / `STANDARD` / `VERBOSE`) | `STANDARD` | Root default log level for all event types. | | `event-log.<EVENT_TYPE>.level` | String (`OFF` / `STANDARD` / `VERBOSE`) | _(inherits from parent in hierarchy)_ | Log level for a specific event type or package. | | `event-log.standard.max-length` | Integer | `4096` | Maximum character length for serialized event content at `STANDARD` level. Only positive values enable truncation; `0` or negative values disable it. Has no effect at `VERBOSE` level. | ### Configuration Examples **Config file:** ```yaml # Root default: log everything at STANDARD event-log.level: STANDARD # Java events: use Java FQCN event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level: VERBOSE event-log.org.apache.flink.agents.api.event.ChatResponseEvent.level: VERBOSE event-log.org.apache.flink.agents.api.event.ContextRetrievalRequestEvent.level: OFF event-log.org.apache.flink.agents.api.event.ContextRetrievalResponseEvent.level: OFF # Python events: use Python module path (the event type string from PythonEvent) event-log.flink_agents.api.events.event.OutputEvent.level: VERBOSE event-log.my_module.MyCustomEvent.level: OFF # Truncation is active by default (max-length: 4096). # To increase the limit: # event-log.standard.max-length: 8192 ``` The config key uses whatever type string appears in the event log's `eventType` field. For Java events, that's the Java FQCN (e.g., `org.apache.flink.agents.api.event.ChatRequestEvent`). For Python events, that's the Python module path (e.g., `flink_agents.api.events.event.OutputEvent`). The hierarchy inheritance works the same way for both — it walks up the dot-separated segments. **Known limitations of the current model:** - **Same logical event requires two config keys**: Java `OutputEvent` and Python `OutputEvent` are the same concept, but they have different type strings (`org.apache.flink.agents.api.OutputEvent` vs `flink_agents.api.events.event.OutputEvent`). There is no single config key that covers both. - **Package-level config doesn't cross languages**: `event-log.org.apache.flink.agents.api.event.level: OFF` silences all Java events in that package, but equivalent Python events are unaffected. - **No common ancestor below root**: Java hierarchies start with `org.apache...`, Python with `flink_agents...`. The only shared ancestor is the root `event-log.level`, which is too broad for targeted control. These limitations are acceptable for the initial release because most jobs today are either pure Java or pure Python. See [Migration to Language-Independent Events](#migration-to-language-independent-events) for how these limitations are resolved when events become language-independent. **Override at job submission time:** ```bash # A shared config.yaml defines defaults for all jobs. # Override just one event type for debugging a specific job run. # Other per-type levels from config.yaml are preserved because # each type has its own independent config key. flink run ... \ -Devent-log.org.apache.flink.agents.api.event.ChatRequestEvent.level=VERBOSE ``` ## Truncation Strategy (STANDARD Level) At `STANDARD` level, events may be truncated to stay within the `event-log.standard.max-length` limit (default: 4096 characters). Truncation **never** applies at `VERBOSE` level. Setting `event-log.standard.max-length` to `0` disables truncation, making `STANDARD` behave identically to `VERBOSE` (except for the metadata label). ### What Gets Truncated Truncation targets the content-heavy parts of the serialized event: 1. **Long string fields** — String values exceeding an internal threshold are shortened and suffixed with `"... [truncated]"`. This most commonly affects LLM response text, tool call arguments, and tool response bodies. 2. **Large arrays/lists** — Arrays with more elements than an internal threshold are trimmed, with a trailing marker indicating how many elements were removed. 3. **Deep nesting** — Object structures nested beyond an internal depth threshold are replaced with a placeholder. The specific thresholds for each strategy are implementation details that may be tuned over time. The semantic contract is: at `STANDARD` level, details might be omitted to keep logs concise. ### What Does NOT Get Truncated Structural and identifying fields are always preserved in full: - `eventType`, `id`, `attributes`, `timestamp` - Top-level scalar fields (model name, request IDs, status flags) ### Truncation Guarantees and Limitations - **Approximate, not exact**: The character limit is a best-effort cap. Actual serialized output may slightly exceed the configured limit due to JSON escaping and structural overhead. Strict enforcement would require double-serialization, which is not worth the cost for a logging feature. - **Truncated content is not independently parseable**: A truncated string field may contain partial JSON or incomplete text. Consumers needing complete structured content from a specific event type should configure that type at `VERBOSE` level. ## Event Log Record Schema This section describes the JSON schema of each record written to the event log file. Two new top-level fields (`logLevel`, `eventType`) are added. Users and downstream tools that parse event log files should be aware of these additions. Records include top-level `logLevel` and `eventType` fields: ```json { "timestamp": "2024-01-15T10:30:00Z", "logLevel": "VERBOSE", "eventType": "org.apache.flink.agents.api.event.ChatRequestEvent", "event": { "eventType": "org.apache.flink.agents.api.event.ChatRequestEvent", "id": "...", "attributes": {}, "model": "gpt-4", "messages": [...] } } ``` At `STANDARD` level with truncation applied: ```json { "timestamp": "2024-01-15T10:30:00Z", "logLevel": "STANDARD", "eventType": "org.apache.flink.agents.api.event.ChatResponseEvent", "event": { "eventType": "org.apache.flink.agents.api.event.ChatResponseEvent", "id": "...", "attributes": {}, "response": "The beginning of a very long LLM response... [truncated]" } } ``` The `eventType` field is emitted at the top level (alongside `timestamp`) for convenient downstream filtering without needing to parse into the `event` object. Old records without `logLevel` or top-level `eventType` are deserialized correctly, defaulting to `VERBOSE` (since they were written before log levels existed and contain full untruncated content). ## Interaction with EventFilter The existing `EventFilter` mechanism continues to work. Log level and event filter compose with AND semantics — both must pass for an event to be logged: | `EventFilter.accept()` | Log Level | Event logged? | |---|---|---| | `true` | `STANDARD` or `VERBOSE` | Yes | | `true` | `OFF` | No | | `false` | any | No | The `EventFilter` is evaluated first. If the filter rejects, the level is not consulted. Any `EventFilter` configured today continues to work unchanged. ## Validation On logger initialization, configured event type names are validated against known event classes. Unrecognized names produce a warning log: ``` WARN - Configured event log level for 'org.apache.flink.agents.api.event.ChatRequstEvent' but no matching event class was found. Check for typos in the config key. ``` This catches typos without failing the job. Custom event types not in the built-in registry trigger the warning but still function correctly at runtime. ## Observability When truncation is active (`event-log.standard.max-length > 0`), a counter metric `eventLogTruncatedEvents` is incremented each time an event is truncated. This helps operators decide whether to increase the length limit or switch specific event types to `VERBOSE`. ## Backward Compatibility - Default log level is `STANDARD` with `max-length=4096`. This is a **behavior change** from today — events at `STANDARD` level may be truncated. To restore previous behavior, set `event-log.level: VERBOSE` or `event-log.standard.max-length: 0`. - JSON records without `logLevel` or top-level `eventType` fields deserialize correctly, defaulting to `VERBOSE` (old records contain full untruncated content). - Existing `EventFilter` configurations continue to work unchanged. - No existing config keys are renamed or removed. ## Migration to Language-Independent Events _(from reviewer feedback, cc @wenjin272)_ There is ongoing discussion about changing events to language-independent JSON objects to simplify custom event definitions, especially for cross-language use cases where users currently need to define the same event type in both Java and Python. ### Current Model (This Design) Config keys use the event's type string as-is — Java FQCNs for Java events, Python module paths for Python events: ```yaml event-log.org.apache.flink.agents.api.event.ChatRequestEvent.level: VERBOSE # Java event-log.flink_agents.api.events.event.OutputEvent.level: VERBOSE # Python ``` This has known limitations in mixed-language jobs (see [Configuration Examples](#configuration-examples)), but is acceptable for the initial release because most jobs today are either pure Java or pure Python. ### Future Model (Language-Independent Events) If events become plain JSON with a user-chosen type string (e.g., `"ChatRequestEvent"`, `"OutputEvent"`), the config keys simplify and the cross-language limitations disappear: ```yaml event-log.ChatRequestEvent.level: VERBOSE # one key covers both Java and Python event-log.OutputEvent.level: VERBOSE # no language-specific namespace ``` ### Migration Plan When language-independent events are adopted: 1. **Event type strings change**: The `eventType` field in log records would change from FQCNs/module paths to plain type strings. Config keys follow automatically since they are based on the `eventType` value. 2. **Deprecation period**: During migration, the system recognizes both old FQCN-style keys and new plain-string keys. If both are configured for the same event, the new key takes precedence. A warning is logged for deprecated FQCN-style keys. 3. **Hierarchy inheritance adapts**: With plain type strings that may not contain dots, hierarchy inheritance becomes less relevant. The root `event-log.level` still serves as the global default. If the community adopts a naming convention with dots (e.g., `chat.request`, `tool.response`), hierarchy inheritance continues to work. ### Design Decision This design targets the current model (Java FQCNs + Python module paths) for the initial implementation. The `event-log.<TYPE>.level` config key pattern and hierarchy inheritance mechanism are compatible with both the current and future models — only the type strings that users write in config files would change during migration. GitHub link: https://github.com/apache/flink-agents/discussions/552 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
