Kristian Rickert created TIKA-4771:
--------------------------------------

             Summary: Pluggable external parsers over gRPC: attach plugin-owned 
results to Document via a typed Any envelope
                 Key: TIKA-4771
                 URL: https://issues.apache.org/jira/browse/TIKA-4771
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Kristian Rickert


Follow-up to TIKA-4766 (PR #2921). Design proposal – feedback wanted on the 
open questions below before code.
h3. Problem

The typed {{Document}} contract deliberately does not model format- or 
domain-specific result shapes (a document-layout model's tree, NLP annotations, 
embeddings). Downstream projects want to attach such results to a Tika parse 
without Tika ever having to model, depend on, or even load their types.
h3. Proposal

A third party implements a small {{ExternalParser}} gRPC service and registers 
it with the Tika gRPC server. For each document whose content type matches the 
registration, the server calls the plugin and appends its result to the 
{{{}Document{}}}. The result rides in an envelope that wraps a 
{{{}google.protobuf.Any{}}}:
{code:java}
message ExtensionResult {
  google.protobuf.Any payload = 1;   // plugin-owned message type; Tika never 
models it
  string plugin_id = 2;              // which registration produced this
  repeated string warnings = 3;
  int64 call_time_ms = 4;
  // future (additive, non-breaking): schema/descriptor reference, links into
  // Document blocks, common cross-plugin metadata
}

// on the Document from TIKA-4766:
repeated ExtensionResult extensions = 40;
{code}
Wrapping the {{Any}} (rather than a bare {{{}repeated Any{}}}) means envelope 
metadata can be added later as {{optional}} fields without another contract 
change.

This extends a pattern tika-grpc already has: the server currently brokers 
registered fetchers, emitters, and pipes iterators; this applies the same 
registration-and-routing model to parse-time enrichment.
h3. Sync and streaming plugin modes

This is where gRPC can shine and really speed up indexing pipelines -

The {{ExternalParser}} service offers {{rpc Parse(...) returns 
(ExternalParseReply)}} (unary, required) and {{rpc ParseStream(...) returns 
(stream ExternalParseReply)}} (optional). The registration record declares 
which modes the plugin supports; plugin authors pick their programming model 
and Tika folds either into {{{}ExtensionResult{}}}(s). Under the document event 
stream (separate follow-up proposal, ticket forthcoming), streamed plugin 
results are forwarded to clients as extension events the moment they arrive, 
rather than after the parse completes.
h3. Type safety without coupling

Think of the Any object as similar to having a struct in java with (Object 
payload, Class<T> clazz) as it's members.

{{Any}} is lazy: the payload stays bytes until a consumer that has the plugin's 
generated class unpacks it.
{code:java}
ExtensionResult r = document.getExtensions(0);
if (r.getPayload().is(DoclingDocument.class)) {
    DoclingDocument doc = r.getPayload().unpack(DoclingDocument.class);  // T 
extends Message
}
{code}
Tika itself never links against plugin classes. Java/Python/Go/Rust clients 
each unpack with their own generated stubs from the plugin's proto.  This 
solves the need for gRPC systems to waste cycles on serializing when all it's 
doing is brokering parts of messages.
h3. Rendering without compiled classes (descriptors)

For JSON transcoding or debugging of a payload whose class is not on the 
classpath, descriptors are required. Proposed path: the registration record may 
optionally carry a serialized {{FileDescriptorSet}} (tika-grpc-api already 
bundles its own descriptors this way), enabling {{DynamicMessage}} + 
{{JsonFormat.TypeRegistry}} rendering. A full schema-registry integration can 
come later as an additive feature; it is not a prerequisite.
h3. Security

Registration makes the server call an arbitrary network address on every future 
matching parse. It must be gated exactly like fetcher/iterator management 
({{{}allowComponentManagement{}}}, disabled by default), TLS-capable, and a 
failing or unreachable plugin must never fail the parse (log + envelope warning 
instead).
h3. Open questions (open input wanted!!)
 # Registration model: runtime RPCs 
({{{}Register/Get/List/UnregisterExternalParser{}}}), static tika-config 
entries, or both?
 # Plugin input: content bytes + content type + the parsed {{{}Document{}}}, so 
plugins never re-fetch or re-parse? Passing the full {{Document}} is the 
current lean.
 # Channel management: pooled/cached channels per registered target, with a 
call deadline.
 # Descriptor transport: inline {{FileDescriptorSet}} on the registration vs. 
external registry vs. defer entirely.

h3. Status

A working prototype (service proto, registry, registration RPCs, tests) exists 
on a branch and will be reshaped to this envelope design after PR #2921 
settles, so the wire contract tracks the final {{Document}} shape. A demo 
consumer is planned on the OpenNLP side (OPENNLP-1833): Tika parse -> typed 
Document -> NLP annotations/embeddings attached as an {{{}ExtensionResult{}}}.
h3. Opinionated side-note

A new parser ships in some other language every other week, and the pace is 
accelerating. This proposal lets Tika ride that wave instead of chasing it: 
each engine owns its result type end-to-end, and Tika orchestrates through the 
common {{Document}} and envelope metadata. It's very much in the spirit of the 
Pipes design – the same registration-and-routing idea that made fetchers and 
emitters pluggable, extended to parse output. I think this makes integrations 
dramatically easier and opens Tika up to parsing capability it would never want 
to carry natively –  as witnessed by my initial design. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to