[
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030535#comment-18030535
]
ASF GitHub Bot commented on TIKA-4513:
--------------------------------------
lewismc opened a new pull request, #2367:
URL: https://github.com/apache/tika/pull/2367
This covers task # 1 (Research and Setup) from
[TIKA-4513](https://issues.apache.org/jira/browse/TIKA-4513) e.g.
> 1. Research and Setup
>
> Review OpenTelemetry Java getting-started guide and instrumentation
registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty HTTP
server, Apache HttpClient).
> Set up a local dev environment with Tika Server, OpenTelemetry Java agent
(latest stable release), and a test collector (e.g., [Grafana
Alloy](https://grafana.com/docs/alloy/latest/) in Docker).
> Prototype basic trace export for a sample /tika request.
I have lots of commentary to add... which I will do in due course. For now I
was thinking of creating a video demo to better communicate the PR and what it
offers.
One important thing, instrumentation (per OTEL) is disabled by default
therefore the impact to existing Tika users is very small.
Before I get around to asking people to review this PR, I want to agree on
how structure the constituent tasks in TIKA-4513. I will continue that
conversation on the Jira ticket.
In the meantime if anyone wishes to take this for a spin the markdown
documentation (most notably `OPENTELEMETRY.md`) will get you up and running.
**NOTE**: I used `Claude-4.5-sonnet` to generate
- the markdown documents, I will note that Claude generates lots of mistakes
which I fixed by hand during my peer review. That being said, I've literally
stepped through this documentation line-by-line now and I genuinely don't think
I could have done it better myself if you gave me another week. I'm impressed
and satisfied with the in-progress result.
- some Javadoc, notably the Javadocs with loads of commentary. Again, I'm
satisfied with the outcome and I think it will assist in a better understanding
of the additions.
- `TikaOpenTelemetryTest.java`... some basic unit test coverage which was
convenient.
- to figure out that `TikaOpenTelemetryConfig` had to `implements
Initializable`... this saved me loads of study time as it had been ages since I
looked at tika-server internals and lots has changed.
This instrumentation mega-project is likely similar in scale to tika-pipes.
There is still loads of work to do.
You will also have noticed that I used
[Jaeger](https://www.jaegertracing.io/) a basic example. I will be providing
another example using [Grafana Alloy as the OTEL
collector](https://github.com/grafana/alloy) as it is much more closely aligned
with $dayjob but that being said I did want to demonstrate the power of OTEL as
a vendor agnostic instrumentation framework. Very powerful indeed.
In the meantime heres a few screenshots which demonstrate what a trace
containing two spans looks like in Jaeger. Pretty basic but exciting stuff.
<img width="1710" height="1112" alt="Screenshot 2025-10-16 at 22 27 23"
src="https://github.com/user-attachments/assets/d6a81991-6ccd-4d54-b743-a8cfc29a7286"
/>
<img width="1710" height="1112" alt="Screenshot 2025-10-16 at 22 27 47"
src="https://github.com/user-attachments/assets/a0e06925-9086-4d87-8dc8-1ab60a187aeb"
/>
> Instrument tika-server
> ----------------------
>
> Key: TIKA-4513
> URL: https://issues.apache.org/jira/browse/TIKA-4513
> Project: Tika
> Issue Type: Improvement
> Components: instrumentation, tika-server
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
>
> Currently, tika-server lacks standardized observability instrumentation,
> relying on basic logging or custom metrics, which limits our ability to
> diagnose performance bottlenecks, track request latencies, or correlate
> failures across distributed deployments (which is readily available via
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL)
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
> Tika Server to enable comprehensive collection of traces, metrics, and logs.
> This will improve system observability, allowing for better monitoring of
> request processing, resource usage, and error rates in a production
> environment.
> The s stable across all major components (traces, metrics and logs), as per
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
> * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
> (e.g., from HTTP ingestion to parser execution), metrics will track
> throughput and error rates, and structured logs will provide context for
> debugging.
> * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
> compatibility with evolving observability backends without vendor lock-in.
> * {*}Low Overhead{*}: We can experiment with
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
> which will initially minimize code changes to develop a baseline. We can
> build on this to better observe custom Tika logic (e.g., parser chaining).
> * {*}Community Benefits{*}: Enhances Tika's appeal for microservices
> architectures, where observability is critical.
> h4. Goals
> * Instrument core Tika Server endpoints (e.g.,
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
> to emit telemetry data.
> * Support configurable exporters for traces/metrics/logs to common backends
> (e.g., OTLP to a collector).
> * Ensure instrumentation does not degrade performance (<5% overhead target)
> and handles high-load scenarios gracefully.
> * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
> * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
> -javaagent:opentelemetry-javaagent.jar).
> * Sample requests generate traces with spans for key operations (e.g.,
> document parsing, MIME detection); verifiable via a tracer exporter (like
> Jaeger).
> * Metrics expose at least: request count, latency histograms, error rates,
> and resource usage (CPU/memory via JVM metrics).
> * Logs are structured and correlated with traces (e.g., via trace/span IDs).
> * Unit/integration tests cover instrumentation (e.g., assert span attributes
> like http.method and content.type).
> * Configuration options added to tika-server.properties for
> enabling/disabling telemetry and setting exporter endpoints.
> * Documentation updated in Tika wiki with setup guide, including Docker
> integration... and then TIka Helm.
> * Performance benchmarks show <5% overhead under load (e.g., using JMeter or
> k6).
> * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
> # Review OpenTelemetry Java getting-started guide and instrumentation
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty
> HTTP server, Apache HttpClient).
> # Set up a local dev environment with Tika Server, OpenTelemetry Java agent
> (latest stable release), and a test collector (e.g., [Grafana
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
> # Prototype basic trace export for a sample /tika request.{*}{{*}}
> *2. Core Instrumentation*
> # Enable auto-instrumentation for HTTP handling and core Tika parsers.
> # Add manual spans for custom logic (e.g., in TikaResource for request
> routing, Parser chain execution).
> # Implement metrics using the Meter API (e.g., counters for processed
> documents, gauges for active parsers).
> # Bridge logs to OpenTelemetry (e.g., via
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}
> *3. Configuration and Exporters*
> # Integrate environment variables or properties for exporter config (e.g.,
> OTLP endpoint, sampling rate).
> # Support batching and sampling to handle scale.
> *4. Testing and Validation*
> # Write tests using OpenTelemetry SDK's in-memory exporter to assert
> telemetry output.
> # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
> # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
> # Update TikaServer README and wiki with instrumentation guide.
> # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
> * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
> without disruption.
> * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
> Maven Central).
> * {*}Risk{*}: Performance impact in parser-heavy workloads.
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
> # If anyone is interested and would like to work me with on this, please let
> me know :)
> # I'll likely create a sub issue for each task, that way we can
> incrementally prove and deliver this larger observability initiative.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)