[
https://issues.apache.org/jira/browse/TIKA-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063220#comment-18063220
]
Tim Allison commented on TIKA-4680:
-----------------------------------
I updated the unpack endpoint on tika-server in 4.x to use tika-pipes for full
recursive extraction.
I think this will still work well in grpc.
I'm thinking about switching the default format for unpack to
"frictionless"...that shouldn't interfere with this work. Any objections?
> tika-grpc: Add unpack/all support for extracting embedded documents
> -------------------------------------------------------------------
>
> Key: TIKA-4680
> URL: https://issues.apache.org/jira/browse/TIKA-4680
> Project: Tika
> Issue Type: Improvement
> Components: tika-pipes
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> h2. Summary
> The tika-grpc server currently only supports FetchAndParse, which returns
> parsed text and metadata for a single document. There is no equivalent of the
> REST server's {code}PUT /unpack/all{code} endpoint, which uses
> RecursiveParserWrapper to extract embedded documents (attachments, slides,
> worksheets) from container formats like EML, PPTX, ZIP, DOCX, etc.
> This was requested by Lawrence Moorehead (elemdisc) in the context of
> TIKA-4679 (HTTP/2 support).
> h2. Proposed Design
> Add a new server-side streaming RPC to the tika-grpc service:
> {code:proto}
> rpc Unpack(FetchAndParseRequest) returns (stream UnpackReply) {}
> message UnpackReply {
> string fetch_key = 1;
> string embedded_resource_path = 2; // e.g. attachment0.pdf or
> word/document.xml
> bytes content = 3; // raw bytes of embedded doc
> map<string, string> metadata = 4; // Tika metadata for this embedded
> doc
> string status = 5;
> string error_message = 6;
> }
> {code}
> * Server implementation uses RecursiveParserWrapper with a
> ContentHandlerFactory that captures each embedded document's bytes
> * Each embedded document (plus the container itself) is streamed as a
> separate UnpackReply message
> * Aligns with REST /unpack/all semantics
> h2. References
> * REST UnpackerResource:
> tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/UnpackerResource.java
> * TIKA-4679: HTTP/2 support (sibling ticket; Lawrence's use case)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)