krickert commented on PR #2916: URL: https://github.com/apache/tika/pull/2916#issuecomment-4848745681
Tim, you're right - I'll make a new proto solution, we can table this one. It's too complex. My use case - I want to take tika's output and use it as an input for opennlp in a typesafe way. But mirroring an open metadata taxonomy in protos is the wrong thing to sign the project up for. I'd maintain it, but that's not scalable. So let me drop that framing and come at it from the other side. The thing worth typing isn't Tika's metadata - it's the parsed document. Here's the shape I'd propose, and it's one small stable proto, not a per-format taxonomy: - **Content as structured blocks** - headings, paragraphs, lists, tables, code, images. It's a standard markdown document model, so it renders straight back to markdown and it's exactly what a RAG/embeddings pipeline wants to consume. This is the actual product, and it's anchored to a spec that doesn't churn. - **Common metadata typed** - title, authors, created/modified as `Timestamp`, page/word counts, language. The cross-format stuff everyone always wants, and where a date has to be a `Timestamp`, not a string 12 languages re-parse. - **Everything else in one native tagged tail** - typed where Tika already declares the type, string otherwise (never guessed). That's the lossless map that replaces the old `fields` map, just multivalue and type-aware. This is actually close to where you landed - the tail is your `map`, just multivalue and type-aware, and the typed surface is the common cross-format fields, a bit past Dublin Core but nowhere near a taxonomy mirror. On the maintenance worry, which is the real one: format specifics don't go in the wire. They go in a per-parser transformer (just code). One `Document` proto. Adding a parser is adding a transformer - the contract doesn't move, so clients never rebuild for it. And to be precise about the rebuild fear: in proto3, adding `optional` fields is backward and forward compatible. Existing clients keep working and simply don't see the new field. Nobody is forced to regenerate unless they actually want the new data. So our metadata churn lands in the mapper and the tail, never in a contract clients have to chase. To answer your question directly - what I need in Tika vs outside: - **In Tika:** the `Document` proto, a generic transformer, and the tagged tail replacing the `fields` map. Small and stable. - **Outside / pluggable:** the richer per-parser transformers can ship as add-on modules. Tika owns a clean contract; the heavy mapping is opt-in. On why bother typing it at all, since I know that's the undercurrent: the whole point of gRPC is that the message *is* the typed object. If the client still has to crawl and re-parse strings, then the serde is the gRPC and we've handed the work back to the user. Protobuf gives you clean JSON for free on top of that, and going the other way never gives you a typed contract. So this isn't type-safety for its own sake - it's what lets Tika be a first-class parser from Rust, Python and Go, not just Java, with one contract across all of them. I'll redo this - give me a day to reshape this. It'll be far fewer fields to maintain and we'll have a transformation interface exist. If it doesn't we'll put it in a struct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
