krickert opened a new pull request, #2921:
URL: https://github.com/apache/tika/pull/2921

   ## Summary
   
   Follow-up to #2916, reshaped per the review there. Instead of mirroring 
Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format 
messages), this PR types the thing that is actually stable: **the parsed 
document**. One small contract — `document.proto` is 208 lines — and format 
specifics live in per-parser mapping *code*, never in the wire.
   
   `FetchAndParseReply.fields` (`map<string,string>`, field 2, now reserved) is 
replaced by `FetchAndParseReply.document`.
   
   ## How this answers the #2916 review
   
   | Concern from #2916 | Where it landed |
   |---|---|
   | "11k lines to nail down maybe 80% of an open set" | 208-line contract; 
whole PR is ~3.9k lines, most of it mapper code + tests |
   | Clients rebuild when metadata definitions change | A metadata key is now 
*data* (`extra` tail), not schema — add/rename/retype a Tika `Property` and no 
client regenerates anything |
   | Lossless catch-all as source of truth | `Document.extra` carries every 
Tika key, multivalue-preserving; a test asserts nothing is dropped |
   | Special handling only for DC + core props | `DocumentMetadata` types only 
the bounded cross-format fields (title, authors, dates as `Timestamp`, counts, 
dimensions, rights) |
   | Break it into individually reviewable tasks | This PR is the contract 
only; see "Deliberately not in this PR" |
   
   ## The shape
   
   1. **Content tree**: `markdown` (the same render `ToMarkdownContentHandler` 
already produces since TIKA-4730) plus `blocks` — that markdown parsed once, 
format-agnostically, into a structured tree of 
headings/paragraphs/lists/**tables**/code blocks/inline runs (CommonMark + GFM, 
a spec that does not churn). This is what a downstream NLP/RAG/embeddings 
consumer actually wants: typed tables and sections, not a string to re-parse.
   2. **Typed common metadata**: `DocumentMetadata`, grouped by concern, not by 
source format. Dates are `Timestamp`s, counts are ints — not strings that 12 
language clients each re-parse.
   3. **Tagged tail**: `extra` — every remaining key, typed only where Tika's 
own `Property` declares a type (integer/real/boolean/date), string otherwise, 
never guessed.
   4. **`embedded`** recurses: a PDF with an embedded image is a parent 
`Document` with a fully typed child — no forcing two formats into one bucket 
(this was the oneof problem from #2916).
   5. Adding a format = adding a `DocumentTransformer` (see 
`tika-grpc-mapper/docs/EXTENSIONS.md`); `PdfDocumentTransformer` is 65 lines 
and the wire contract does not move.
   
   ## Deliberately not in this PR (follow-ups, each its own PR)
   
   - **Pluggable external parsers**: registering a third-party gRPC service 
whose output rides along on the `Document` as a `google.protobuf.Any` — so 
wildly different result shapes (e.g. a document-layout model's tree) never 
require Tika to model them. Built and tested on a branch; kept out to keep this 
reviewable.
   - **A Markdown parser** for `.md` input files (separate JIRA).
   - Richer typed fields, if and only if real cross-format demand appears — 
they'd be additive `optional` fields, compatible both directions.
   
   ## Open decisions where reviewer preference wins
   
   1. **Tail shape**: `repeated MetadataField` with a typed value oneof (as 
implemented) vs the `map<string, StringList>` suggested in #2916. The 
typed-where-declared tail preserves types without guessing; the map is 
maximally churn-proof. Swapping is a one-message change — happy to go either 
way.
   2. **`markdown` + `blocks` both**: today both ship (string render + 
structured tree). If payload size matters, a per-request flag choosing one is 
easy.
   3. **Hard removal vs staged**: `fields` is hard-removed (4.0, nothing 
consumes it yet); can switch to deprecate-then-remove if preferred.
   
   ## Client migration
   
   | Before | After |
   |--------|-------|
   | `fields["X-TIKA:content"]` | `document.markdown` (or walk 
`document.blocks`) |
   | `fields["Content-Type"]` | `document.content_type` |
   | Ad hoc title/author/date strings | `document.metadata.title` / `.authors` 
/ `.created` (`Timestamp`) |
   | Any other key | `document.extra` (typed by declared `Property` type, 
string otherwise) |
   
   ## Test plan
   
   - [x] `./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test` — green 
(transformer tests against real parse fixtures per format, block-tree tests, 
`DocumentBuilder` envelope/status/embedded tests, server tests reading 
`FetchAndParseReply.document`)
   - [x] `tika-grpc-api` jar bundles 
`META-INF/org.apache.tika.grpc.v1.descriptors` (verified: contains 
`document.proto`)
   - [x] e2e `tika-grpc-e2e-test` compiles against the new API
   - [ ] CI
   
   Downstream context: this contract is what the OpenNLP gRPC work 
(OPENNLP-1833) will consume as input — Tika parse → typed document → 
NLP/embeddings without re-parsing strings.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to