[
https://issues.apache.org/jira/browse/TIKA-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086411#comment-18086411
]
ASF GitHub Bot commented on TIKA-4753:
--------------------------------------
Copilot commented on code in PR #2870:
URL: https://github.com/apache/tika/pull/2870#discussion_r3363745267
##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf
http://localhost:9998/meta/Content-Type # single field
* `/translate/all/\{translator}/\{src}/\{dest}` — translation
* `/pipes`, `/async` — Pipes-based bulk processing
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error
with a
+JSON body whose shape matches the `PipesResult` status:
Review Comment:
This section intro frames JSON error bodies as only applying to
process-level failures (timeout/OOM/crash), but the table below (and the server
behavior) also covers task/initialization failures that return `500` with the
same JSON body. Widening the wording here will prevent readers from assuming
non-process failures still return plain text.
##########
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/PipesParsingHelper.java:
##########
@@ -184,33 +187,56 @@ private String getSuffix(Metadata metadata) {
return ".tmp";
}
+ /**
+ * Builds a JSON error response carrying a subset of the {@code
PipesResult}
+ * serialization — the {@code status} and, when present, a non-blank
{@code message}:
+ * {@code {"status": "TIMEOUT", "message": "..."}}. Successful-parse
fields such as
+ * {@code emitData} are never part of an error body.
+ * <p>
+ * This allows clients to distinguish failure modes (TIMEOUT, OOM,
UNSPECIFIED_CRASH, …)
+ * without parsing plain-text bodies or inspecting custom headers.
+ */
+ private static Response buildProcessFailureResponse(PipesResult result) {
+ ObjectMapper mapper = new ObjectMapper();
+ ObjectNode node = mapper.createObjectNode();
+ node.put("status", result.status().name());
+ if (result.message() != null && !result.message().isBlank()) {
+ node.put("message", result.message());
+ }
Review Comment:
`buildProcessFailureResponse` currently includes `result.message()` verbatim
in the JSON response when non-blank. In practice, `PipesResult.message()` is
often populated with `ExceptionUtils.getStackTrace(e)` (multi-line stack
traces) for statuses like `FETCH_EXCEPTION`, `EMIT_EXCEPTION`, and
`UNSPECIFIED_CRASH`, which can leak internal server details and can produce
very large HTTP error bodies. Consider only including a short, single-line
message (or omitting it entirely) to avoid exposing stack traces by default.
##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf
http://localhost:9998/meta/Content-Type # single field
* `/translate/all/\{translator}/\{src}/\{dest}` — translation
* `/pipes`, `/async` — Pipes-based bulk processing
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error
with a
+JSON body whose shape matches the `PipesResult` status:
+
+[source,json]
+----
+{"status": "TIMEOUT", "message": "Task timed out after 60000ms"}
+----
+
+The `status` field is the `PipesResult.RESULT_STATUS` enum name. The `message`
field is
+present when Tika provided one, absent otherwise.
+
+[cols="1,1,3"]
+|===
+|HTTP status |`status` values |Meaning
+
+|`503 Service Unavailable`
+|`TIMEOUT`, `OOM`, `UNSPECIFIED_CRASH`, `CLIENT_UNAVAILABLE_WITHIN_MS`
+|The forked parse process failed, or no parse client became available within
the
+configured wait time (`CLIENT_UNAVAILABLE_WITHIN_MS`). The server is still
healthy;
+the client may retry.
+
+|`500 Internal Server Error`
+|`FAILED_TO_INITIALIZE`, `FETCH_EXCEPTION`, `EMIT_EXCEPTION`,
+`FETCHER_NOT_FOUND`, `EMITTER_NOT_FOUND`,
+`FETCHER_INITIALIZATION_EXCEPTION`, `EMITTER_INITIALIZATION_EXCEPTION`
+|Server misconfiguration or a task-level infrastructure error. Retrying the
same
+document on the same server is unlikely to succeed without a configuration fix.
+|===
+
+NOTE: A successful parse that encountered internal parser errors (e.g. a
truncated
+embedded document) still returns `200 OK`. The partial-parse exception is
surfaced
+in the `X-TIKA:CONTAINER_EXCEPTION` metadata field of the response, not as an
HTTP
+error code.
Review Comment:
The docs reference `X-TIKA:CONTAINER_EXCEPTION`, but the actual metadata key
used by `TikaCoreProperties.CONTAINER_EXCEPTION` is
`X-TIKA:EXCEPTION:container_exception`. Using the wrong field name will mislead
clients looking for partial-parse exceptions.
> Improve msg on oom/timeout in tika-server's /tika/json endpoint
> ---------------------------------------------------------------
>
> Key: TIKA-4753
> URL: https://issues.apache.org/jira/browse/TIKA-4753
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)