Copilot commented on code in PR #2870:
URL: https://github.com/apache/tika/pull/2870#discussion_r3363745267


##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf 
http://localhost:9998/meta/Content-Type   # single field
 * `/translate/all/\{translator}/\{src}/\{dest}` — translation
 * `/pipes`, `/async` — Pipes-based bulk processing
 
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process 
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error 
with a
+JSON body whose shape matches the `PipesResult` status:

Review Comment:
   This section intro frames JSON error bodies as only applying to 
process-level failures (timeout/OOM/crash), but the table below (and the server 
behavior) also covers task/initialization failures that return `500` with the 
same JSON body. Widening the wording here will prevent readers from assuming 
non-process failures still return plain text.



##########
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/PipesParsingHelper.java:
##########
@@ -184,33 +187,56 @@ private String getSuffix(Metadata metadata) {
         return ".tmp";
     }
 
+    /**
+     * Builds a JSON error response carrying a subset of the {@code 
PipesResult}
+     * serialization — the {@code status} and, when present, a non-blank 
{@code message}:
+     * {@code {"status": "TIMEOUT", "message": "..."}}. Successful-parse 
fields such as
+     * {@code emitData} are never part of an error body.
+     * <p>
+     * This allows clients to distinguish failure modes (TIMEOUT, OOM, 
UNSPECIFIED_CRASH, …)
+     * without parsing plain-text bodies or inspecting custom headers.
+     */
+    private static Response buildProcessFailureResponse(PipesResult result) {
+        ObjectMapper mapper = new ObjectMapper();
+        ObjectNode node = mapper.createObjectNode();
+        node.put("status", result.status().name());
+        if (result.message() != null && !result.message().isBlank()) {
+            node.put("message", result.message());
+        }

Review Comment:
   `buildProcessFailureResponse` currently includes `result.message()` verbatim 
in the JSON response when non-blank. In practice, `PipesResult.message()` is 
often populated with `ExceptionUtils.getStackTrace(e)` (multi-line stack 
traces) for statuses like `FETCH_EXCEPTION`, `EMIT_EXCEPTION`, and 
`UNSPECIFIED_CRASH`, which can leak internal server details and can produce 
very large HTTP error bodies. Consider only including a short, single-line 
message (or omitting it entirely) to avoid exposing stack traces by default.



##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -156,6 +156,43 @@ curl -T document.pdf 
http://localhost:9998/meta/Content-Type   # single field
 * `/translate/all/\{translator}/\{src}/\{dest}` — translation
 * `/pipes`, `/async` — Pipes-based bulk processing
 
+== Error Responses
+
+When parsing fails due to a process-level problem — the forked child process 
timed out,
+ran out of memory, or crashed unexpectedly — the server returns an HTTP error 
with a
+JSON body whose shape matches the `PipesResult` status:
+
+[source,json]
+----
+{"status": "TIMEOUT", "message": "Task timed out after 60000ms"}
+----
+
+The `status` field is the `PipesResult.RESULT_STATUS` enum name. The `message` 
field is
+present when Tika provided one, absent otherwise.
+
+[cols="1,1,3"]
+|===
+|HTTP status |`status` values |Meaning
+
+|`503 Service Unavailable`
+|`TIMEOUT`, `OOM`, `UNSPECIFIED_CRASH`, `CLIENT_UNAVAILABLE_WITHIN_MS`
+|The forked parse process failed, or no parse client became available within 
the
+configured wait time (`CLIENT_UNAVAILABLE_WITHIN_MS`). The server is still 
healthy;
+the client may retry.
+
+|`500 Internal Server Error`
+|`FAILED_TO_INITIALIZE`, `FETCH_EXCEPTION`, `EMIT_EXCEPTION`,
+`FETCHER_NOT_FOUND`, `EMITTER_NOT_FOUND`,
+`FETCHER_INITIALIZATION_EXCEPTION`, `EMITTER_INITIALIZATION_EXCEPTION`
+|Server misconfiguration or a task-level infrastructure error. Retrying the 
same
+document on the same server is unlikely to succeed without a configuration fix.
+|===
+
+NOTE: A successful parse that encountered internal parser errors (e.g. a 
truncated
+embedded document) still returns `200 OK`. The partial-parse exception is 
surfaced
+in the `X-TIKA:CONTAINER_EXCEPTION` metadata field of the response, not as an 
HTTP
+error code.

Review Comment:
   The docs reference `X-TIKA:CONTAINER_EXCEPTION`, but the actual metadata key 
used by `TikaCoreProperties.CONTAINER_EXCEPTION` is 
`X-TIKA:EXCEPTION:container_exception`. Using the wrong field name will mislead 
clients looking for partial-parse exceptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to