nddipiazza opened a new pull request, #2811:
URL: https://github.com/apache/tika/pull/2811

   ## Summary
   
   Adds `TikaTypedResponse` as an **experimental, opt-in** alternative to the 
flat `map<string,string> fields` in `FetchAndParseReply`.  The existing 
`fields` map is unchanged — no breaking change for existing clients.
   
   Sparked by a conversation with Kristian Rickert 
([@krickert](https://github.com/krickert)) who built a comprehensive typed 
proto schema for Tika metadata at 
[ai-pipestream/pipestream-protos](https://github.com/ai-pipestream/pipestream-protos)
 and opened the discussion about whether tika-grpc should expose the same.
   
   ## Motivation
   
   Tika's internal metadata model is already strongly typed.  The gRPC layer 
currently serialises everything to strings:
   
   ```
   // before
   map<string, string> fields = 2;   // "pdf:encrypted" -> "true", 
"xmpTPg:NPages" -> "3"
   ```
   
   That forces callers to:
   - Re-parse booleans, integers, and timestamps from strings
   - Handle repeated values that are squashed to a single string
   - Spend CPU cycles on avoidable serialisation in both directions
   
   As Kristian pointed out, this essentially gives you the same overhead as 
JSON — with none of the type safety benefits of protobuf.
   
   ## Changes
   
   | File | Description |
   |------|-------------|
   | `tika-grpc/src/main/proto/tika_typed_response.proto` | New proto: 
`TikaTypedResponse`, `DublinCoreMetadata`, `PdfTypedMetadata`, 
`OfficeTypedMetadata`, `ImageTypedMetadata`, `EmailTypedMetadata`, 
`MediaTypedMetadata`, `GenericTypedMetadata`, `TikaTypedParseStatus`, 
`TikaEmbeddedDocument` |
   | `tika-grpc/src/main/proto/tika.proto` | Add `TikaTypedResponse 
typed_response = 5` to `FetchAndParseReply` |
   | `tika-grpc/src/main/java/.../TikaTypedMetadataMapper.java` | Maps 
`List<Metadata>` → `TikaTypedResponse`; dispatch by Content-Type |
   | `tika-grpc/src/main/java/.../TikaGrpcServerImpl.java` | Wire mapper in 
`fetchAndParseImpl()` |
   
   ## Design
   
   ```protobuf
   message FetchAndParseReply {
     string fetch_key = 1;
     map<string, string> fields = 2;          // existing — unchanged
     string status = 3;
     string error_message = 4;
     TikaTypedResponse typed_response = 5;   // new (experimental)
   }
   
   message TikaTypedResponse {
     TikaTextContent content = 1;
     DublinCoreMetadata dublin_core = 2;
     oneof document_metadata {
       PdfTypedMetadata pdf = 3;
       OfficeTypedMetadata office = 4;
       ImageTypedMetadata image = 5;
       EmailTypedMetadata email = 6;
       MediaTypedMetadata media = 7;
       GenericTypedMetadata generic = 8;
     }
     TikaTypedParseStatus parse_status = 9;
     repeated TikaEmbeddedDocument embedded_documents = 10;
     map<string, string> overflow_fields = 11;   // unmapped fields
   }
   ```
   
   The `oneof document_metadata` branch is selected by the `Content-Type` of 
the primary metadata entry.  Any metadata key not handled by the typed branch 
lands in `overflow_fields` so callers never lose data.
   
   ## Review Focus Areas
   
   - **Proto field coverage** — are there important Tika metadata fields 
missing from each typed message?
   - **oneof vs. separate messages** — is the `oneof document_metadata` 
approach the right shape, or should we use a different extension strategy?
   - **Naming conventions** — field names follow Kristian's reference design; 
are they consistent with existing Tika naming?
   - **Mapper correctness** — Tika metadata key strings are in 
`TikaTypedMetadataMapper`; spot-check against actual `Metadata` output for your 
document types
   - **Experimental gate** — should population of `typed_response` be gated on 
a request flag rather than always-on?
   
   ## Critical Files
   
   - `tika-grpc/src/main/proto/tika_typed_response.proto`
   - 
`tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaTypedMetadataMapper.java`
   
   ## Testing Instructions
   
   1. Start the gRPC server with any fetcher config
   2. Call `FetchAndParse` on a PDF — `reply.typed_response.pdf` should contain 
typed fields
   3. Call on an Office document — `reply.typed_response.office` should have 
word_count, page_count, etc.
   4. Verify `reply.fields` still contains the full flat map (no regression)
   5. The existing e2e tests in `tika-e2e-tests` cover the base behaviour; no 
typed-response-specific tests yet (intentional for this draft)
   
   ## Review Checklist
   
   - [ ] Proto backwards compatibility (new optional field 5 — OK per proto3 
rules)
   - [ ] No change to existing `fields` map population
   - [ ] `TikaTypedMetadataMapper` handles null / missing metadata gracefully
   - [ ] Content-Type dispatch covers common document families
   
   ## Potential Concerns
   
   - **Maintenance burden**: typed fields need updating if Tika adds new 
metadata keys.  The `overflow_fields` map ensures no data loss in the meantime.
   - **Field count**: Kristian's full schema maps ~1500 fields; this PR covers 
the most common families.  PRs for additional type branches (HTML, archive, 
font, WARC, etc.) can follow.
   - **Proto stability**: marking as experimental allows iteration on field 
names/numbers before the schema is frozen.
   
   Credit: [Kristian Rickert](https://github.com/krickert) — 
[pipestream-protos](https://github.com/ai-pipestream/pipestream-protos) served 
as the reference design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to