[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845302#comment-17845302
 ] 

ASF GitHub Bot commented on TIKA-4252:
--------------------------------------

tballison commented on code in PR #1753:
URL: https://github.com/apache/tika/pull/1753#discussion_r1596634451


##########
tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java:
##########
@@ -455,33 +455,33 @@ private Fetcher getFetcher(FetchEmitTuple t) {
         }
     }
 
-    protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple t, 
Fetcher fetcher) {
-        FetchKey fetchKey = t.getFetchKey();
+    protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple 
fetchEmitTuple, Fetcher fetcher) {
+        FetchKey fetchKey = fetchEmitTuple.getFetchKey();
+        Metadata fetchResponseMetadata = new Metadata();

Review Comment:
   The metadata that goes in the fetchemittuple was envisioned to be 
user-injected metadata that was injected after the parse and then emitted (e.g. 
provenance metadata).
   
   I think we need to put both metadatas on the fetchemittuple.
   
   This is what I'm thinking...let me know what you think.
   
   So, there will be three metadatas in play. The fetchemit tuple will have a 
fetchRequestMetadata (???) and a userMetadata (???). At parse time, we'll 
create a fresh metadata object, which we'll call "responseMetadata" in the 
following call: fetcher.fetch(requestMetadata, responseMetadata).
   
   The parse will then use the responseMetadata and, after the parse, inject 
the userMetadata from the fetchEmitTuple.
   
   The fetcher may use the fetchRequestMetadata to carry out its request, but 
info from that one should not make it into the "responseMetadata" nor make it 
into the emit data.





> PipesClient#process - seems to lose the Fetch input metadata?
> -------------------------------------------------------------
>
>                 Key: TIKA-4252
>                 URL: https://issues.apache.org/jira/browse/TIKA-4252
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>             Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to