krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4844253570

   The Claude review was useful and I went through all 7 points. Pushed in 
`99883db0c`.
   
   Short version: the "augment, not replace" shape you recommended and where 
this landed are the same now. One lossless, populated channel as the source of 
truth (`ParseResponse.metadata`), a typed convenience layer on top, 
Content-Type trusted, and the oneof dropped for coexisting submessages.
   
   Below is a summary of the changes that line up with the concerns that 
yourself and Claude pointed out (summary provided by claude based on my initial 
write up) - 
   
   | #   | Concern                   | Status       | What changed              
                                                                                
                                                                        |
   | --- | ------------------------- | ------------ | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | 1   | Dead wire fields          | Fixed        | `metadata` (field 12) 
populated; `raw_properties` (14) reserved; empty Structs removed                
                                                                            |
   | 2   | Four overlapping channels | Down to one  | only 
`ParseResponse.metadata` is a catch-all now; the 13 per-format 
`additional_metadata`, `BaseFields.raw_metadata`, and 4 `GenericMetadata` 
Structs are gone (numbers reserved) |
   | 3   | Re-detects format, twice  | Fixed        | single `detect()`, reads 
`Content-Type` first, exposed as a typed `primary_format` enum                  
                                                                         |
   | 4   | oneof fights Tika's model | Fixed + test | oneof to coexisting 
`optional` peers; `ParseResponseCoexistenceTest`                                
                                                                              |
   | 5   | Maintaining the mirror    | Bounded      | desync no longer drops 
data, see below                                                                 
                                                                           |
   | 6   | Fidelity loss (Struct)    | Fixed        | Struct gone; 
`MetadataEntry` keeps multivalue and native types; Property constants           
                                                                                
     |
   | 7   | Scope creep               | Partial      | helper script removed; 
EPUB extractor I can split if you want                                          
                                                                           |
   
   Write up (provided by me, edited for grammar by claude)
   
   **1. Dead fields.** `metadata` is populated for every key now, 
`raw_properties` is reserved. `ParseResponseMetadataPopulationTest` asserts 
`metadata.names().length == response.getMetadataCount()`, so there's no 
always-empty public field anymore and it's guard-railed.
   
   **2. Overlapping channels.** One catch-all: `ParseResponse.metadata`. Every 
per-format `additional_metadata`, `BaseFields.raw_metadata` (that one was a 
full dump embedded in every message), and the four `GenericMetadata` Structs 
are removed, field numbers reserved. Precedence is written up in 
`tika-grpc-api/README.md`.
   
   **3. Detection.** One `detect()` call that leads with `Content-Type` 
(extension and CC are fallbacks only when MIME is missing), surfaced as 
`DocumentFormatCategory primary_format` so clients read an enum instead of 
re-deriving.
   
   **4. The oneof.** Gone, replaced with independent `optional` submessages 
(same field numbers, so it stays wire-compatible). They coexist, so DC and CC 
aren't special cases anymore. `ParseResponseCoexistenceTest` covers both a 
format submessage plus the CC overlay on one response, and a parent PDF with an 
embedded child typed `IMAGE` on its own `parsed_content`. That last bit is 
where "EXIF on embedded images" actually belongs, instead of forcing a second 
bucket onto the parent.
   
   **5. Maintenance.** This is the big one.  It's often the first "shock" 
people get when surfacing a gRPC interface - it's it's also why most 
implementations never take off.  I believe in keeping the ugly parsing details 
behind the contract and not pushing that work to the user.  That's the center 
of the decision - and there' s honestly no good compromise (avoid frameworks 
like bazel as it turns into just as much mapping scaffolding and you're at the 
whim of yet another layer - its best to own it and go first class.  LLMs are 
great at writing mapping code and will only get better). 
   
   Two things keep it bounded:
   
   - No silent data loss when a field desyncs. The mirror is lossless and the 
test above proves every Tika key lands in `metadata`, typed. So if core renames 
or adds a Property and a typed field stops matching, the value doesn't 
disappear into a lossy Struct, it's still in `metadata` with the right type. 
Worst case is a convenience field is briefly missing, not a dropped value. That 
kills the specific failure mode the review called out.
   - It's cheap. I lifted this model from an OSS pipeline I run live and it's 
been about an hour a year to keep current (thanks to LLMs this is now easy). 
The interfaces don't move much at all.  If they do, we still capture unknowns. 
(Also, I'll be glad to maintain this - as I need to maintain it anyway for 
other projects)
   
   And the typed layer is opt-in. If you want zero coupling to the taxonomy you 
read `metadata` and ignore the typed fields. The mirror is the contract, the 
typed fields are convenience. My take is still that if the contract stays 
dynamic, JSON is the better fit, the whole reason to be on gRPC is the typed 
schema.  Simply put - grpc is pretty bad at being dynamic.  So don't fight it, 
embrace the type safety and downstream client code looks great and all those 
parsing casting errors you've seen for years just go away.
   
   **6. Fidelity.** No more Struct, so no JSON-ish doubles and no lost 
String[]. `MetadataEntry` keeps multivalue via `text_list` and native types 
(int64/double/bool/Timestamp) coerced from Tika Property definitions. Swapped 
the hardcoded "Keywords"/"Content-Length" for Property constants, and 
multivalue keywords are joined.
   
   **7. Scope.** The `commit-typed-parse-response.sh` script is gone. The 
detector trusts Content-Type and sits in the mapper as routing, not parsing, 
but I can move it. `EpubStructureExtractor` is a fair hit. I can split it, and 
trim WARC/font/climate to a thinner first cut, into a follow-up if you'd rather 
keep this PR lean.
   
   On staging the break: nothing consumes the gRPC `fields` map yet and this is 
4.0, so I did a hard removal instead of deprecate-then-remove. Easy to switch 
to staged if you'd prefer not to break it in one shot.  But again, the fields 
map is not really the design you want - the JSON interface is far better for 
this.
   
   Why I care about the typing: it lets me wire tika into the opennlp-sandbox 
gRPC work (OPENNLP-1833) and on to embeddings without re-parsing metadata Tika 
already typed.  I also created mapping tools that I'm going to share that make 
the mappings between protobufs via CEL selectors - making mapping truely 
dynamic while not resorting to java reflection.
   
   I know this is a lot - and I'll keep up with responding to any concerns.  
   
   Where I'd want your read: how curated the typed surface should be (you 
floated DC plus a dozen common fields), the EPUB-extractor and detector 
altitude, and hard-remove vs staged. Which of those matter most to you?  
   
   I think that the popularity of this resides on being able to get other 
languages like Rust and Python to play well with the interface.  So a strongly 
typed one - where java can feel like a pain - is actually strongly preferred in 
a python IDE and would make this package very attractive as a first class 
parser that exceeds speed and capabilities of any parser out there.
   
   Tika is great because it's a powerhouse for speed.  It's reliable.  This 
would polish up the surface and give that to 12 languages in a solid contract.  
So to me the contract is most important and why this reply focuses so heavily 
on it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to