[ 
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030634#comment-18030634
 ] 

Tim Allison commented on TIKA-4513:
-----------------------------------

[~lewismc] thank you for undertaking this work. [~ndipiazza] has proposed some 
significant changes to tika-server, including TIKA-3082.

 

It might make sense to hold a virtual meetup/google meet to discuss staging the 
work?

> Instrument tika-server
> ----------------------
>
>                 Key: TIKA-4513
>                 URL: https://issues.apache.org/jira/browse/TIKA-4513
>             Project: Tika
>          Issue Type: Improvement
>          Components: instrumentation, tika-server
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Currently, tika-server lacks standardized observability instrumentation, 
> relying on basic logging or custom metrics, which limits our ability to 
> diagnose performance bottlenecks, track request latencies, or correlate 
> failures across distributed deployments (which is readily available via 
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL) 
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache 
> Tika Server to enable comprehensive collection of traces, metrics, and logs. 
> This will improve system observability, allowing for better monitoring of 
> request processing, resource usage, and error rates in a production 
> environment.
> The s stable across all major components (traces, metrics and logs), as per 
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger 
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would 
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
>  * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows 
> (e.g., from HTTP ingestion to parser execution), metrics will track 
> throughput and error rates, and structured logs will provide context for 
> debugging.
>  * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure 
> compatibility with evolving observability backends without vendor lock-in.
>  * {*}Low Overhead{*}: We can experiment with 
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
>  which will initially minimize code changes to develop a baseline. We can 
> build on this to better observe custom Tika logic (e.g., parser chaining).
>  * {*}Community Benefits{*}: Enhances Tika's appeal for microservices 
> architectures, where observability is critical.
> h4. Goals
>  * Instrument core Tika Server endpoints (e.g., 
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>  
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>  
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
>  to emit telemetry data.
>  * Support configurable exporters for traces/metrics/logs to common backends 
> (e.g., OTLP to a collector).
>  * Ensure instrumentation does not degrade performance (<5% overhead target) 
> and handles high-load scenarios gracefully.
>  * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
>  * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via 
> -javaagent:opentelemetry-javaagent.jar).
>  * Sample requests generate traces with spans for key operations (e.g., 
> document parsing, MIME detection); verifiable via a tracer exporter (like 
> Jaeger).
>  * Metrics expose at least: request count, latency histograms, error rates, 
> and resource usage (CPU/memory via JVM metrics).
>  * Logs are structured and correlated with traces (e.g., via trace/span IDs).
>  * Unit/integration tests cover instrumentation (e.g., assert span attributes 
> like http.method and content.type).
>  * Configuration options added to tika-server.properties for 
> enabling/disabling telemetry and setting exporter endpoints.
>  * Documentation updated in Tika wiki with setup guide, including Docker 
> integration... and then TIka Helm.
>  * Performance benchmarks show <5% overhead under load (e.g., using JMeter or 
> k6).
>  * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
>  # Review OpenTelemetry Java getting-started guide and instrumentation 
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty 
> HTTP server, Apache HttpClient).
>  # Set up a local dev environment with Tika Server, OpenTelemetry Java agent 
> (latest stable release), and a test collector (e.g., [Grafana 
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
>  # Prototype basic trace export for a sample /tika request.
> *2. Core Instrumentation*
>  # Enable auto-instrumentation for HTTP handling and core Tika parsers.
>  # Add manual spans for custom logic (e.g., in TikaResource for request 
> routing, Parser chain execution).
>  # Implement metrics using the Meter API (e.g., counters for processed 
> documents, gauges for active parsers).
>  # Bridge logs to OpenTelemetry (e.g., via 
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{*}
> *3. Configuration and Exporters*
>  # Integrate environment variables or properties for exporter config (e.g., 
> OTLP endpoint, sampling rate).
>  # Support batching and sampling to handle scale.
> *4. Testing and Validation*
>  # Write tests using OpenTelemetry SDK's in-memory exporter to assert 
> telemetry output.
>  # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
>  # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
>  # Update TikaServer README and wiki with instrumentation guide.
>  # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
>  * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g., 
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation 
> without disruption.
>  * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via 
> Maven Central).
>  * {*}Risk{*}: Performance impact in parser-heavy workloads. 
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
>  # If anyone is interested and would like to work me with on this, please let 
> me know :)
>  # I'll likely create a sub issue for each task, that way we can 
> incrementally prove and deliver this larger observability initiative.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to