[ 
https://issues.apache.org/jira/browse/TIKA-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086411#comment-18086411
 ] 

ASF GitHub Bot commented on TIKA-4753:
--------------------------------------

Copilot commented on code in PR #2870:
URL: https://github.com/apache/tika/pull/2870#discussion_r3363745267


##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf 
http://localhost:9998/meta/Content-Type   # single field
 * `/translate/all/\{translator}/\{src}/\{dest}` — translation
 * `/pipes`, `/async` — Pipes-based bulk processing
 
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process 
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error 
with a
+JSON body whose shape matches the `PipesResult` status:

Review Comment:
   This section intro frames JSON error bodies as only applying to 
process-level failures (timeout/OOM/crash), but the table below (and the server 
behavior) also covers task/initialization failures that return `500` with the 
same JSON body. Widening the wording here will prevent readers from assuming 
non-process failures still return plain text.



##########
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/PipesParsingHelper.java:
##########
@@ -184,33 +187,56 @@ private String getSuffix(Metadata metadata) {
         return ".tmp";
     }
 
+    /**
+     * Builds a JSON error response carrying a subset of the {@code 
PipesResult}
+     * serialization — the {@code status} and, when present, a non-blank 
{@code message}:
+     * {@code {"status": "TIMEOUT", "message": "..."}}. Successful-parse 
fields such as
+     * {@code emitData} are never part of an error body.
+     * <p>
+     * This allows clients to distinguish failure modes (TIMEOUT, OOM, 
UNSPECIFIED_CRASH, …)
+     * without parsing plain-text bodies or inspecting custom headers.
+     */
+    private static Response buildProcessFailureResponse(PipesResult result) {
+        ObjectMapper mapper = new ObjectMapper();
+        ObjectNode node = mapper.createObjectNode();
+        node.put("status", result.status().name());
+        if (result.message() != null && !result.message().isBlank()) {
+            node.put("message", result.message());
+        }

Review Comment:
   `buildProcessFailureResponse` currently includes `result.message()` verbatim 
in the JSON response when non-blank. In practice, `PipesResult.message()` is 
often populated with `ExceptionUtils.getStackTrace(e)` (multi-line stack 
traces) for statuses like `FETCH_EXCEPTION`, `EMIT_EXCEPTION`, and 
`UNSPECIFIED_CRASH`, which can leak internal server details and can produce 
very large HTTP error bodies. Consider only including a short, single-line 
message (or omitting it entirely) to avoid exposing stack traces by default.



##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf 
http://localhost:9998/meta/Content-Type   # single field
 * `/translate/all/\{translator}/\{src}/\{dest}` — translation
 * `/pipes`, `/async` — Pipes-based bulk processing
 
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process 
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error 
with a
+JSON body whose shape matches the `PipesResult` status:
+
+[source,json]
+----
+{"status": "TIMEOUT", "message": "Task timed out after 60000ms"}
+----
+
+The `status` field is the `PipesResult.RESULT_STATUS` enum name. The `message` 
field is
+present when Tika provided one, absent otherwise.
+
+[cols="1,1,3"]
+|===
+|HTTP status |`status` values |Meaning
+
+|`503 Service Unavailable`
+|`TIMEOUT`, `OOM`, `UNSPECIFIED_CRASH`, `CLIENT_UNAVAILABLE_WITHIN_MS`
+|The forked parse process failed, or no parse client became available within 
the
+configured wait time (`CLIENT_UNAVAILABLE_WITHIN_MS`). The server is still 
healthy;
+the client may retry.
+
+|`500 Internal Server Error`
+|`FAILED_TO_INITIALIZE`, `FETCH_EXCEPTION`, `EMIT_EXCEPTION`,
+`FETCHER_NOT_FOUND`, `EMITTER_NOT_FOUND`,
+`FETCHER_INITIALIZATION_EXCEPTION`, `EMITTER_INITIALIZATION_EXCEPTION`
+|Server misconfiguration or a task-level infrastructure error. Retrying the 
same
+document on the same server is unlikely to succeed without a configuration fix.
+|===
+
+NOTE: A successful parse that encountered internal parser errors (e.g. a 
truncated
+embedded document) still returns `200 OK`. The partial-parse exception is 
surfaced
+in the `X-TIKA:CONTAINER_EXCEPTION` metadata field of the response, not as an 
HTTP
+error code.

Review Comment:
   The docs reference `X-TIKA:CONTAINER_EXCEPTION`, but the actual metadata key 
used by `TikaCoreProperties.CONTAINER_EXCEPTION` is 
`X-TIKA:EXCEPTION:container_exception`. Using the wrong field name will mislead 
clients looking for partial-parse exceptions.





> Improve msg on oom/timeout in tika-server's /tika/json endpoint
> ---------------------------------------------------------------
>
>                 Key: TIKA-4753
>                 URL: https://issues.apache.org/jira/browse/TIKA-4753
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to