tballison commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4842589016
I've only had a chance to look at this briefly.
The challenges as you know: a) under the hood, we're currently storing
metadata keys as strings, b) metadata keys are an open set with
user-customizable keys, c) there are a lot of keys.
This was claude's review
```
What's legitimate
- Replacing a flat map<string,string> with a typed contract is a
reasonable thing to want — string-keyed maps are a weak contract for gRPC
clients.
- The tika-grpc-api / tika-grpc-mapper / tika-grpc module split (clients
depend on schema, not server) is good hygiene.
- Unmapped keys aren't simply dropped: each format message has an
additional_metadata Struct (e.g. pdf_metadata.proto:253) and the builders fill
it via
MetadataUtils.buildAdditionalMetadata(..., mappedFields). So it's not as
lossy as the "replace the map" framing first suggests.
What doesn't hold up
1. Dead fields in the public wire contract. ParseResponse.raw_properties
(field 14) and repeated MetadataEntry metadata (field 12) are declared in the
proto with
comments promising "anything not captured in typed metadata" — but the
mapper populates neither (0 references in ParseResponseMapper.java). Shipping
public proto
fields that are always empty is a contract smell; a client will reasonably
read raw_properties and get nothing.
2. Four overlapping places a value can live. typed format field →
per-format additional_metadata Struct → dublin_core → creative_commons, plus
the two dead
channels. There's no single predictable place to find a given property.
That's harder to consume correctly than the map it replaces.
3. It re-detects the format Tika already told you.
DocumentTypeDetector.detect() (316 lines of heuristics over metadata) chooses
the oneof branch — but Tika
already set Content-Type. Re-deriving it heuristically is fragile and
duplicative, and it's called twice (mapper lines 198 and 327).
4. The oneof document_metadata fights Tika's data model. Tika metadata
routinely spans namespaces (a PDF carrying XMP/DC, EXIF on embedded images, …).
Forcing
exactly one format bucket is precisely why DC and CC had to be pulled out
of the oneof as special cases — the model is already breaking under its own
weight, and
every new cross-cutting namespace adds another special case.
5. A hand-maintained parallel copy of Tika's entire metadata taxonomy.
~5,000 lines of proto + ~4,000 lines of builders that must track
org.apache.tika.metadata.*
by hand forever. When core adds/renames a Property, the typed field
silently desyncs and the value quietly drops into additional_metadata. That's a
large
permanent maintenance tax for marginal gain over a good catch-all.
6. Fidelity loss. Catch-alls use google.protobuf.Struct, which is JSON-ish
(numbers are doubles, no native String[]); Tika metadata is uniformly
multivalued
strings. And small tells confirm looseness: hardcoded keys
"Keywords"/"Content-Length" instead of Property constants, and content.keywords
is a scalar though Tika
keywords are multivalue.
7. Scope creep past "typed parse response." EpubStructureExtractor (483
lines) and the detector live in the gRPC mapper — that's parser/detection logic
migrating
into the serialization layer, the wrong altitude.
WARC/font/climate-forecast/database schemas widen the surface a lot for a first
cut. There's also a
commit-typed-parse-response.sh script checked in, which doesn't belong in
the PR.
My recommendation
The shape I'd push toward is augment, not replace:
- Keep one lossless, uniform, actually-populated channel as source of
truth (a repeated key/value that preserves multivalue), then layer a small
curated set of
typed convenience fields (DC + a dozen genuinely common PDF/office fields)
on top.
- Trust Content-Type; drop the format oneof in favor of coexisting
sub-messages (or just DC + raw).
- Don't move EPUB structure extraction / type detection into the mapper.
- Wire or delete the dead fields, and stage the break (add parse_response
alongside fields, deprecate, then remove) rather than a 10.7k-line breaking
mega-PR.
```
5. gives me great concern from a maintenance perspective. I haven't looked
carefully enough to know if this is real.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]