[
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030734#comment-18030734
]
Nicholas DiPiazza commented on TIKA-4513:
-----------------------------------------
[~lewismc] [~tallison]
A meet up would be great.
I will reach out via email and schedule something.
> Instrument tika-server
> ----------------------
>
> Key: TIKA-4513
> URL: https://issues.apache.org/jira/browse/TIKA-4513
> Project: Tika
> Issue Type: Improvement
> Components: instrumentation, tika-server
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 4.0.0
>
>
> Currently, tika-server lacks standardized observability instrumentation,
> relying on basic logging or custom metrics, which limits our ability to
> diagnose performance bottlenecks, track request latencies, or correlate
> failures across distributed deployments (which is readily available via
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL)
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
> Tika Server to enable comprehensive collection of traces, metrics, and logs.
> This will improve system observability, allowing for better monitoring of
> request processing, resource usage, and error rates in a production
> environment.
> The s stable across all major components (traces, metrics and logs), as per
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
> * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
> (e.g., from HTTP ingestion to parser execution), metrics will track
> throughput and error rates, and structured logs will provide context for
> debugging.
> * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
> compatibility with evolving observability backends without vendor lock-in.
> * {*}Low Overhead{*}: We can experiment with
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
> which will initially minimize code changes to develop a baseline. We can
> build on this to better observe custom Tika logic (e.g., parser chaining).
> * {*}Community Benefits{*}: Enhances Tika's appeal for microservices
> architectures, where observability is critical.
> h4. Goals
> * Instrument core Tika Server endpoints (e.g.,
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
> to emit telemetry data.
> * Support configurable exporters for traces/metrics/logs to common backends
> (e.g., OTLP to a collector).
> * Ensure instrumentation does not degrade performance (<5% overhead target)
> and handles high-load scenarios gracefully.
> * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
> * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
> -javaagent:opentelemetry-javaagent.jar).
> * Sample requests generate traces with spans for key operations (e.g.,
> document parsing, MIME detection); verifiable via a tracer exporter (like
> Jaeger).
> * Metrics expose at least: request count, latency histograms, error rates,
> and resource usage (CPU/memory via JVM metrics).
> * Logs are structured and correlated with traces (e.g., via trace/span IDs).
> * Unit/integration tests cover instrumentation (e.g., assert span attributes
> like http.method and content.type).
> * Configuration options added to tika-server.properties for
> enabling/disabling telemetry and setting exporter endpoints.
> * Documentation updated in Tika wiki with setup guide, including Docker
> integration... and then TIka Helm.
> * Performance benchmarks show <5% overhead under load (e.g., using JMeter or
> k6).
> * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
> # Review OpenTelemetry Java getting-started guide and instrumentation
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty
> HTTP server, Apache HttpClient).
> # Set up a local dev environment with Tika Server, OpenTelemetry Java agent
> (latest stable release), and a test collector (e.g., [Grafana
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
> # Prototype basic trace export for a sample /tika request.
> *2. Core Instrumentation*
> # Enable auto-instrumentation for HTTP handling and core Tika parsers.
> # Add manual spans for custom logic (e.g., in TikaResource for request
> routing, Parser chain execution).
> # Implement metrics using the Meter API (e.g., counters for processed
> documents, gauges for active parsers).
> # Bridge logs to OpenTelemetry (e.g., via
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{*}
> *3. Configuration and Exporters*
> # Integrate environment variables or properties for exporter config (e.g.,
> OTLP endpoint, sampling rate).
> # Support batching and sampling to handle scale.
> *4. Testing and Validation*
> # Write tests using OpenTelemetry SDK's in-memory exporter to assert
> telemetry output.
> # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
> # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
> # Update TikaServer README and wiki with instrumentation guide.
> # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
> * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
> without disruption.
> * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
> Maven Central).
> * {*}Risk{*}: Performance impact in parser-heavy workloads.
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
> # If anyone is interested and would like to work me with on this, please let
> me know :)
> # I'll likely create a sub issue for each task, that way we can
> incrementally prove and deliver this larger observability initiative.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)