[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093461#comment-17093461
 ] 

Tim Allison commented on TIKA-3093:
---

I wonder if it would be simpler if we offered four forwarding options: Solr, 
Elasticsearch, FileSystem and custom.  We could load the "custom" from 
SPI...users could drop their jar in the tika-server.jar directory.

Under this proposal, we would not use the Solr/ES clients, we'd do our own 
mapping, which should be fairly straightforward.

I'm hesitant to add implementation/tool specific forwarding options (e.g. Solr 
and Elasticsearch), but I don't want to have everyone rolling their own.  The 
problem here, obv, will be tracking with different versions of Solr/ES.  My 
sense is that the APIs for adding docs hasn't changed much in these two 
projects.

There are several things that I don't like about this, and I'm open to -1 and 
better options.  I'd rather not be stuck here: https://xkcd.com/974/


> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091926#comment-17091926
 ] 

Lewis John McGibbney commented on TIKA-3093:


[~tallison]

bq. ...will converting tika-server to OpenAPI (TIKA-3082) take away the need 
for this?

No, it just provides improved (standard) formalization of the REST interfaces.

Thanks for tagging me here, maybe I could can take this one on after I come 
back to TIKA-3082

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091797#comment-17091797
 ] 

Tim Allison commented on TIKA-3093:
---

The big bit remaining question is how do we allow users to map from 

{noformat}
[
  { ...main doc },
  { embedded doc1},
  { embedded doc2}
]
{noformat}

to, say:
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-JSONExamples
{noformat}
[
  {
"id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"_childDocuments_": [
  {
"id": "2",
"comments": "SolrCloud supports it too!"
  }
]
  },
{noformat}

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091774#comment-17091774
 ] 

Tim Allison commented on TIKA-3093:
---

A strawman proposal...

This relies on the /rmeta style output, e.g. 
[^test_recursive_embedded.docx.json].

Users could specify mappings in a forward-config.json file like so at server 
startup.

{noformat}
{
"url":"http://localhost:8983/solr;,
"method":"(put|post)",
"onException":"(skip|continue)",
"fields" : {
"include_non_mapped":false
"mappings" : {
"Content-Type" : "mime",
"X-TIKA:content" : "content"
}
}
}
{noformat}

They'd put their bytes to http://localhost:9998/tika_forward.  In the http 
headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H 
"field: myfield ; something_special".

If there's a parse exception and "onException" is "continue", then the 
stacktrace would be stored in the /rmeta output, and the document would be 
forwarded.  If set to "skip", the handler would throw an exception back to the 
client.

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091709#comment-17091709
 ] 

Tim Allison commented on TIKA-3093:
---

[~lewismc] will converting tika-server to OpenAPI (right term?) take away the 
need for this?

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091708#comment-17091708
 ] 

Chris Mattmann commented on TIKA-3093:
--

yea we have lots of pipelines with OODT and Tika that does this already 
([http://github.com/apache/drat/)] is a classic example of this...

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091703#comment-17091703
 ] 

David Eric Pugh commented on TIKA-3093:
---

Out of curiosity, is this type of behavior, the "Let me chain a set of 
interactions together" something that already exists?   

Imagine System A is a CMS, System B is Tika, and System C is Solr...

What if I wanted to do something like "Send a request for a doc by id to system 
A, have it dig up doc by id in System A, then forward to System B for 
Extraction, and then forward to System C for storage"..   

Is there an already existing pattern for this that Tika could conform too?   It 
feels like a pipe of some kind...   



> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)