Copilot commented on code in PR #2870: URL: https://github.com/apache/tika/pull/2870#discussion_r3363745267
########## docs/modules/ROOT/pages/using-tika/server/index.adoc: ########## @@ -156,6 +156,43 @@ curl -T document.pdf http://localhost:9998/meta/Content-Type # single field * `/translate/all/\{translator}/\{src}/\{dest}` — translation * `/pipes`, `/async` — Pipes-based bulk processing +== Error Responses + +When parsing fails due to a process-level problem — the forked child process timed out, +ran out of memory, or crashed unexpectedly — the server returns an HTTP error with a +JSON body whose shape matches the `PipesResult` status: Review Comment: This section intro frames JSON error bodies as only applying to process-level failures (timeout/OOM/crash), but the table below (and the server behavior) also covers task/initialization failures that return `500` with the same JSON body. Widening the wording here will prevent readers from assuming non-process failures still return plain text. ########## tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/PipesParsingHelper.java: ########## @@ -184,33 +187,56 @@ private String getSuffix(Metadata metadata) { return ".tmp"; } + /** + * Builds a JSON error response carrying a subset of the {@code PipesResult} + * serialization — the {@code status} and, when present, a non-blank {@code message}: + * {@code {"status": "TIMEOUT", "message": "..."}}. Successful-parse fields such as + * {@code emitData} are never part of an error body. + * <p> + * This allows clients to distinguish failure modes (TIMEOUT, OOM, UNSPECIFIED_CRASH, …) + * without parsing plain-text bodies or inspecting custom headers. + */ + private static Response buildProcessFailureResponse(PipesResult result) { + ObjectMapper mapper = new ObjectMapper(); + ObjectNode node = mapper.createObjectNode(); + node.put("status", result.status().name()); + if (result.message() != null && !result.message().isBlank()) { + node.put("message", result.message()); + } Review Comment: `buildProcessFailureResponse` currently includes `result.message()` verbatim in the JSON response when non-blank. In practice, `PipesResult.message()` is often populated with `ExceptionUtils.getStackTrace(e)` (multi-line stack traces) for statuses like `FETCH_EXCEPTION`, `EMIT_EXCEPTION`, and `UNSPECIFIED_CRASH`, which can leak internal server details and can produce very large HTTP error bodies. Consider only including a short, single-line message (or omitting it entirely) to avoid exposing stack traces by default. ########## docs/modules/ROOT/pages/using-tika/server/index.adoc: ########## @@ -156,6 +156,43 @@ curl -T document.pdf http://localhost:9998/meta/Content-Type # single field * `/translate/all/\{translator}/\{src}/\{dest}` — translation * `/pipes`, `/async` — Pipes-based bulk processing +== Error Responses + +When parsing fails due to a process-level problem — the forked child process timed out, +ran out of memory, or crashed unexpectedly — the server returns an HTTP error with a +JSON body whose shape matches the `PipesResult` status: + +[source,json] +---- +{"status": "TIMEOUT", "message": "Task timed out after 60000ms"} +---- + +The `status` field is the `PipesResult.RESULT_STATUS` enum name. The `message` field is +present when Tika provided one, absent otherwise. + +[cols="1,1,3"] +|=== +|HTTP status |`status` values |Meaning + +|`503 Service Unavailable` +|`TIMEOUT`, `OOM`, `UNSPECIFIED_CRASH`, `CLIENT_UNAVAILABLE_WITHIN_MS` +|The forked parse process failed, or no parse client became available within the +configured wait time (`CLIENT_UNAVAILABLE_WITHIN_MS`). The server is still healthy; +the client may retry. + +|`500 Internal Server Error` +|`FAILED_TO_INITIALIZE`, `FETCH_EXCEPTION`, `EMIT_EXCEPTION`, +`FETCHER_NOT_FOUND`, `EMITTER_NOT_FOUND`, +`FETCHER_INITIALIZATION_EXCEPTION`, `EMITTER_INITIALIZATION_EXCEPTION` +|Server misconfiguration or a task-level infrastructure error. Retrying the same +document on the same server is unlikely to succeed without a configuration fix. +|=== + +NOTE: A successful parse that encountered internal parser errors (e.g. a truncated +embedded document) still returns `200 OK`. The partial-parse exception is surfaced +in the `X-TIKA:CONTAINER_EXCEPTION` metadata field of the response, not as an HTTP +error code. Review Comment: The docs reference `X-TIKA:CONTAINER_EXCEPTION`, but the actual metadata key used by `TikaCoreProperties.CONTAINER_EXCEPTION` is `X-TIKA:EXCEPTION:container_exception`. Using the wrong field name will mislead clients looking for partial-parse exceptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
