[ 
https://issues.apache.org/jira/browse/TIKA-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063220#comment-18063220
 ] 

Tim Allison commented on TIKA-4680:
-----------------------------------

I updated the unpack endpoint on tika-server in 4.x to use tika-pipes for full 
recursive extraction. 

I think this will still work well in grpc.

I'm thinking about switching the default format for unpack to 
"frictionless"...that shouldn't interfere with this work. Any objections?

> tika-grpc: Add unpack/all support for extracting embedded documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-4680
>                 URL: https://issues.apache.org/jira/browse/TIKA-4680
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-pipes
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> h2. Summary
> The tika-grpc server currently only supports FetchAndParse, which returns 
> parsed text and metadata for a single document. There is no equivalent of the 
> REST server's {code}PUT /unpack/all{code} endpoint, which uses 
> RecursiveParserWrapper to extract embedded documents (attachments, slides, 
> worksheets) from container formats like EML, PPTX, ZIP, DOCX, etc.
> This was requested by Lawrence Moorehead (elemdisc) in the context of 
> TIKA-4679 (HTTP/2 support).
> h2. Proposed Design
> Add a new server-side streaming RPC to the tika-grpc service:
> {code:proto}
> rpc Unpack(FetchAndParseRequest) returns (stream UnpackReply) {}
> message UnpackReply {
>   string fetch_key = 1;
>   string embedded_resource_path = 2;    // e.g. attachment0.pdf or 
> word/document.xml
>   bytes  content = 3;                    // raw bytes of embedded doc
>   map<string, string> metadata = 4;      // Tika metadata for this embedded 
> doc
>   string status = 5;
>   string error_message = 6;
> }
> {code}
> * Server implementation uses RecursiveParserWrapper with a 
> ContentHandlerFactory that captures each embedded document's bytes
> * Each embedded document (plus the container itself) is streamed as a 
> separate UnpackReply message
> * Aligns with REST /unpack/all semantics
> h2. References
> * REST UnpackerResource: 
> tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/UnpackerResource.java
> * TIKA-4679: HTTP/2 support (sibling ticket; Lawrence's use case)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to