Kristian Rickert created TIKA-4772:
--------------------------------------

             Summary: Stream parsed-document events over gRPC; unary Document 
becomes the fold of the stream
                 Key: TIKA-4772
                 URL: https://issues.apache.org/jira/browse/TIKA-4772
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Kristian Rickert


h3. Motivation

Tika's parsers are already incremental: every parser emits its output 
progressively through the {{ContentHandler}} contract as it reads the input. 
The buffered single-reply gRPC response is an artifact of the transport, not of 
Tika's model. Exposing the parse as an event stream (a) matches what the 
parsers already do, (b) gives time-to-first-block latency for RAG/embedding/NLP 
pipelines that can begin work while a large document is still parsing, (c) 
bounds client memory on huge documents, and (d) makes partial results on 
failure precise instead of best-effort.
h3. Design: one contract, two delivery modes
{code:java}
rpc FetchAndParseEvents(FetchAndParseRequest) returns (stream DocumentEvent) {}

message DocumentEvent {
  oneof event {
    DocumentEnvelope envelope = 1;   // first event: id, content_type, origin
    BlockEvent block = 2;            // one content Block, in document order
    DocumentMetadata metadata = 3;   // emitted when known (often late in the 
parse)
    MetadataField extra = 4;
    ExtensionResult extension = 5;   // external parser results, forwarded as 
they arrive
    ParseStatus status = 6;          // terminal event
  }
}
message BlockEvent {
  repeated int32 embedded_path = 1;  // [] = root document; [2] = third 
embedded child
  Block block = 2;
}
{code}
Events are append-only immutable facts (a late {{metadata}} event completes, 
never revises). The defining invariant: *The fold of the event stream is 
exactly the {{Document}} from TIKA-4766.* Backward compatibility falls out by 
construction rather than by discipline: the existing unary-style reply is 
implemented AS the fold – the server consumes its own event stream to 
completion and emits the assembled {{Document}}. There is one parse path, so 
the streamed and materialized forms cannot drift, and existing clients see 
identical behavior forever. A mid-parse failure terminates the stream with 
{{PARTIAL}},   and the fold yields a Document holding exactly the blocks parsed 
before the failure -- {{PARTIAL}} becomes a precise statement instead of a 
best-effort flag.
h3. The event model is the wire's native model

{{{}DocumentEvent{}}}/{{{}Block{}}} is the contract's native representation. 
Tika's existing parsers keep their ContentHandler contract exactly as-is: a 
bridge insideoutput into document events at the edge, and a reverse adapter 
serves anything that consumes ContentHandlers today – nothing about the Java 
parser APIchanges, now or later. The same event stream can equally be produced 
directly, by server components or (via the external parser mechanism, 
TIKA-4771) by services in any language, so every producer is first-class and 
identical on the wire. Structure the block model cannot express maps to 
{{{}HtmlBlock{}}}/metadata, never silently dropped.

This extends a pattern tika-grpc already has: the server currently brokers 
registered fetchers, emitters, and pipes iterators; this applies the same 
brokering to parse output.
h3. Phasing
 # *Phase 1 (contract + cheap wins):* declare the RPC; serve a coarse but valid 
stream (envelope, then blocks, then metadata, then terminal status); implement 
the unary-style reply as the fold. Streaming results from external parsers 
({{{}ParseStream{}}} in TIKA-4771) also land here - the gRPC server calls those 
services directly, so forwarding their streams touches nothing outside 
tika-grpc.
 # *Phase 2 (fine granularity):* frame {{DocumentEvent}} across the tika-pipes 
forked-worker boundary so events flow live during the parse. The worker 
protocol speaks {{DocumentEvent}} frames; the existing parser fleet feeds it 
through the {{ContentHandler}} bridge at the edge. This touches tika-pipes-core 
and deserves its own discussion - maintainer input explicitly wanted on 
whether/how the worker framing should support incremental events. Because unary 
== fold, improving granularity later changes neither the contract nor unary 
behavior.

h3. Open questions
 # Pipes worker protocol: appetite and preferred shape for incremental event 
framing across the fork boundary (this is the phase-2 gate)?
 # Ordering/assembly: block order per {{embedded_path}} is document order; is 
that sufficient, or do we want explicit sequence numbers for resumability?
 # Flow control: rely on gRPC per-stream backpressure, or add a server-side 
high-water mark with a documented buffering policy?
 # Metadata timing: emit known-early fields (content type, origin) in the 
envelope and the rest at completion, or allow multiple cumulative {{metadata}} 
events?

h3. Relationship to other work

Depends on TIKA-4766 (the {{{}Document{}}}/{{{}Block{}}} contract this 
streams). Relates to TIKA-4771 (external parsers; their streamed results ride 
this stream as {{extension}} events). Phase 2 will spawn a linked tika-pipes 
issue once there is agreement on direction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to