[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092609#comment-18092609
 ] 

ASF GitHub Bot commented on TIKA-4766:
--------------------------------------

tballison commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4842589016

   I've only had a chance to look at this briefly. 
   
   The challenges as you know: a) under the hood, we're currently storing 
metadata keys as strings, b) metadata keys are an open set with 
user-customizable keys, c) there are a lot of keys.
   
   This was claude's review
   
   ```
   What's legitimate
   
     - Replacing a flat map<string,string> with a typed contract is a 
reasonable thing to want — string-keyed maps are a weak contract for gRPC 
clients.
     - The tika-grpc-api / tika-grpc-mapper / tika-grpc module split (clients 
depend on schema, not server) is good hygiene.
     - Unmapped keys aren't simply dropped: each format message has an 
additional_metadata Struct (e.g. pdf_metadata.proto:253) and the builders fill 
it via
     MetadataUtils.buildAdditionalMetadata(..., mappedFields). So it's not as 
lossy as the "replace the map" framing first suggests.
   
     What doesn't hold up
   
     1. Dead fields in the public wire contract. ParseResponse.raw_properties 
(field 14) and repeated MetadataEntry metadata (field 12) are declared in the 
proto with
     comments promising "anything not captured in typed metadata" — but the 
mapper populates neither (0 references in ParseResponseMapper.java). Shipping 
public proto
     fields that are always empty is a contract smell; a client will reasonably 
read raw_properties and get nothing.
   
     2. Four overlapping places a value can live. typed format field → 
per-format additional_metadata Struct → dublin_core → creative_commons, plus 
the two dead
     channels. There's no single predictable place to find a given property. 
That's harder to consume correctly than the map it replaces.
   
     3. It re-detects the format Tika already told you. 
DocumentTypeDetector.detect() (316 lines of heuristics over metadata) chooses 
the oneof branch — but Tika
     already set Content-Type. Re-deriving it heuristically is fragile and 
duplicative, and it's called twice (mapper lines 198 and 327).
   
     4. The oneof document_metadata fights Tika's data model. Tika metadata 
routinely spans namespaces (a PDF carrying XMP/DC, EXIF on embedded images, …). 
Forcing
     exactly one format bucket is precisely why DC and CC had to be pulled out 
of the oneof as special cases — the model is already breaking under its own 
weight, and
     every new cross-cutting namespace adds another special case.
   
     5. A hand-maintained parallel copy of Tika's entire metadata taxonomy. 
~5,000 lines of proto + ~4,000 lines of builders that must track 
org.apache.tika.metadata.*
     by hand forever. When core adds/renames a Property, the typed field 
silently desyncs and the value quietly drops into additional_metadata. That's a 
large
     permanent maintenance tax for marginal gain over a good catch-all.
   
     6. Fidelity loss. Catch-alls use google.protobuf.Struct, which is JSON-ish 
(numbers are doubles, no native String[]); Tika metadata is uniformly 
multivalued
     strings. And small tells confirm looseness: hardcoded keys 
"Keywords"/"Content-Length" instead of Property constants, and content.keywords 
is a scalar though Tika
     keywords are multivalue.
   
     7. Scope creep past "typed parse response." EpubStructureExtractor (483 
lines) and the detector live in the gRPC mapper — that's parser/detection logic 
migrating
     into the serialization layer, the wrong altitude. 
WARC/font/climate-forecast/database schemas widen the surface a lot for a first 
cut. There's also a
     commit-typed-parse-response.sh script checked in, which doesn't belong in 
the PR.
   
   My recommendation
   
     The shape I'd push toward is augment, not replace:
     - Keep one lossless, uniform, actually-populated channel as source of 
truth (a repeated key/value that preserves multivalue), then layer a small 
curated set of
     typed convenience fields (DC + a dozen genuinely common PDF/office fields) 
on top.
     - Trust Content-Type; drop the format oneof in favor of coexisting 
sub-messages (or just DC + raw).
     - Don't move EPUB structure extraction / type detection into the mapper.
     - Wire or delete the dead fields, and stage the break (add parse_response 
alongside fields, deprecate, then remove) rather than a 10.7k-line breaking 
mega-PR.
   
   ```
   
   
   5. gives me great concern from a maintenance perspective. I haven't looked 
carefully enough to know if this is real.




> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map 
> ({{{}map<string,string>{}}}) with a typed 
> {{org.apache.tika.grpc.v1.ParseResponse}} on 
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
> supported formats). Dublin Core fields are normalized on the response root. 
> Creative Commons / XMP rights metadata is exposed on a dedicated field when 
> present.
> This is a *breaking change* for gRPC clients that read string keys from 
> {{{}fields{}}}.
> h2. Motivation
>  * Clients today must parse hundreds of ad hoc string keys with no schema or 
> type safety.
>  * Format-specific metadata is easier to consume and evolve with protobuf + 
> bundled descriptors.
>  * Mapping logic is separated from the gRPC server so it can be tested and 
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
> {{fields}}|
> *Out of scope (follow-up tickets):*
>  * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
>  * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
> (currently additional_struct MVP)
>  * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2, 
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
>  # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
> {{{}content{}}}, optional {{embedded_docs}}
>  # *Dublin Core:* {{dublin_core}} (shared across formats)
>  # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
> {{{}generic{}}})
>  # *Creative Commons:* {{creative_commons}} when XMP rights metadata is 
> present
> h2. Architecture
> {code:none}
> Client
>   -> TikaGrpcServerImpl (tika-grpc)
>        -> Tika Pipes / parsers -> Metadata + body
>        -> ParseResponseMapper (tika-grpc-mapper)
>             -> format builders -> ParseResponse (tika-grpc-api)
>   <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
> stubs/descriptors.
> h2. Implementation notes
>  * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
>  * {{creative_commons}} is field 25 outside the document oneof so it can 
> coexist with PDF/Office/etc.
>  * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
> post-processing (e.g. outlines) without coupling the core mapper to 
> PDFBox/HTML libraries
>  * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
>  * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
> removed/reserved
>  * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
>  * ( ) PDF, Office, HTML, and at least one other format return the expected 
> typed oneof in integration tests
>  * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
>  * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
> clients
>  * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
>  * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
>  * Confirm no regression in fetcher CRUD and streaming RPCs
>  * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
> rights
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to