nddipiazza opened a new pull request, #2811: URL: https://github.com/apache/tika/pull/2811
## Summary Adds `TikaTypedResponse` as an **experimental, opt-in** alternative to the flat `map<string,string> fields` in `FetchAndParseReply`. The existing `fields` map is unchanged — no breaking change for existing clients. Sparked by a conversation with Kristian Rickert ([@krickert](https://github.com/krickert)) who built a comprehensive typed proto schema for Tika metadata at [ai-pipestream/pipestream-protos](https://github.com/ai-pipestream/pipestream-protos) and opened the discussion about whether tika-grpc should expose the same. ## Motivation Tika's internal metadata model is already strongly typed. The gRPC layer currently serialises everything to strings: ``` // before map<string, string> fields = 2; // "pdf:encrypted" -> "true", "xmpTPg:NPages" -> "3" ``` That forces callers to: - Re-parse booleans, integers, and timestamps from strings - Handle repeated values that are squashed to a single string - Spend CPU cycles on avoidable serialisation in both directions As Kristian pointed out, this essentially gives you the same overhead as JSON — with none of the type safety benefits of protobuf. ## Changes | File | Description | |------|-------------| | `tika-grpc/src/main/proto/tika_typed_response.proto` | New proto: `TikaTypedResponse`, `DublinCoreMetadata`, `PdfTypedMetadata`, `OfficeTypedMetadata`, `ImageTypedMetadata`, `EmailTypedMetadata`, `MediaTypedMetadata`, `GenericTypedMetadata`, `TikaTypedParseStatus`, `TikaEmbeddedDocument` | | `tika-grpc/src/main/proto/tika.proto` | Add `TikaTypedResponse typed_response = 5` to `FetchAndParseReply` | | `tika-grpc/src/main/java/.../TikaTypedMetadataMapper.java` | Maps `List<Metadata>` → `TikaTypedResponse`; dispatch by Content-Type | | `tika-grpc/src/main/java/.../TikaGrpcServerImpl.java` | Wire mapper in `fetchAndParseImpl()` | ## Design ```protobuf message FetchAndParseReply { string fetch_key = 1; map<string, string> fields = 2; // existing — unchanged string status = 3; string error_message = 4; TikaTypedResponse typed_response = 5; // new (experimental) } message TikaTypedResponse { TikaTextContent content = 1; DublinCoreMetadata dublin_core = 2; oneof document_metadata { PdfTypedMetadata pdf = 3; OfficeTypedMetadata office = 4; ImageTypedMetadata image = 5; EmailTypedMetadata email = 6; MediaTypedMetadata media = 7; GenericTypedMetadata generic = 8; } TikaTypedParseStatus parse_status = 9; repeated TikaEmbeddedDocument embedded_documents = 10; map<string, string> overflow_fields = 11; // unmapped fields } ``` The `oneof document_metadata` branch is selected by the `Content-Type` of the primary metadata entry. Any metadata key not handled by the typed branch lands in `overflow_fields` so callers never lose data. ## Review Focus Areas - **Proto field coverage** — are there important Tika metadata fields missing from each typed message? - **oneof vs. separate messages** — is the `oneof document_metadata` approach the right shape, or should we use a different extension strategy? - **Naming conventions** — field names follow Kristian's reference design; are they consistent with existing Tika naming? - **Mapper correctness** — Tika metadata key strings are in `TikaTypedMetadataMapper`; spot-check against actual `Metadata` output for your document types - **Experimental gate** — should population of `typed_response` be gated on a request flag rather than always-on? ## Critical Files - `tika-grpc/src/main/proto/tika_typed_response.proto` - `tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaTypedMetadataMapper.java` ## Testing Instructions 1. Start the gRPC server with any fetcher config 2. Call `FetchAndParse` on a PDF — `reply.typed_response.pdf` should contain typed fields 3. Call on an Office document — `reply.typed_response.office` should have word_count, page_count, etc. 4. Verify `reply.fields` still contains the full flat map (no regression) 5. The existing e2e tests in `tika-e2e-tests` cover the base behaviour; no typed-response-specific tests yet (intentional for this draft) ## Review Checklist - [ ] Proto backwards compatibility (new optional field 5 — OK per proto3 rules) - [ ] No change to existing `fields` map population - [ ] `TikaTypedMetadataMapper` handles null / missing metadata gracefully - [ ] Content-Type dispatch covers common document families ## Potential Concerns - **Maintenance burden**: typed fields need updating if Tika adds new metadata keys. The `overflow_fields` map ensures no data loss in the meantime. - **Field count**: Kristian's full schema maps ~1500 fields; this PR covers the most common families. PRs for additional type branches (HTML, archive, font, WARC, etc.) can follow. - **Proto stability**: marking as experimental allows iteration on field names/numbers before the schema is frozen. Credit: [Kristian Rickert](https://github.com/krickert) — [pipestream-protos](https://github.com/ai-pipestream/pipestream-protos) served as the reference design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
