[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093461#comment-17093461 ] Tim Allison commented on TIKA-3093: --- I wonder if it would be simpler if we offered four forwarding options: Solr, Elasticsearch, FileSystem and custom. We could load the "custom" from SPI...users could drop their jar in the tika-server.jar directory. Under this proposal, we would not use the Solr/ES clients, we'd do our own mapping, which should be fairly straightforward. I'm hesitant to add implementation/tool specific forwarding options (e.g. Solr and Elasticsearch), but I don't want to have everyone rolling their own. The problem here, obv, will be tracking with different versions of Solr/ES. My sense is that the APIs for adding docs hasn't changed much in these two projects. There are several things that I don't like about this, and I'm open to -1 and better options. I'd rather not be stuck here: https://xkcd.com/974/ > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: test_recursive_embedded.docx.json > > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091926#comment-17091926 ] Lewis John McGibbney commented on TIKA-3093: [~tallison] bq. ...will converting tika-server to OpenAPI (TIKA-3082) take away the need for this? No, it just provides improved (standard) formalization of the REST interfaces. Thanks for tagging me here, maybe I could can take this one on after I come back to TIKA-3082 > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: test_recursive_embedded.docx.json > > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091797#comment-17091797 ] Tim Allison commented on TIKA-3093: --- The big bit remaining question is how do we allow users to map from {noformat} [ { ...main doc }, { embedded doc1}, { embedded doc2} ] {noformat} to, say: https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-JSONExamples {noformat} [ { "id": "1", "title": "Solr adds block join support", "content_type": "parentDocument", "_childDocuments_": [ { "id": "2", "comments": "SolrCloud supports it too!" } ] }, {noformat} > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: test_recursive_embedded.docx.json > > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091774#comment-17091774 ] Tim Allison commented on TIKA-3093: --- A strawman proposal... This relies on the /rmeta style output, e.g. [^test_recursive_embedded.docx.json]. Users could specify mappings in a forward-config.json file like so at server startup. {noformat} { "url":"http://localhost:8983/solr;, "method":"(put|post)", "onException":"(skip|continue)", "fields" : { "include_non_mapped":false "mappings" : { "Content-Type" : "mime", "X-TIKA:content" : "content" } } } {noformat} They'd put their bytes to http://localhost:9998/tika_forward. In the http headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H "field: myfield ; something_special". If there's a parse exception and "onException" is "continue", then the stacktrace would be stored in the /rmeta output, and the document would be forwarded. If set to "skip", the handler would throw an exception back to the client. > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: test_recursive_embedded.docx.json > > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091709#comment-17091709 ] Tim Allison commented on TIKA-3093: --- [~lewismc] will converting tika-server to OpenAPI (right term?) take away the need for this? > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091708#comment-17091708 ] Chris Mattmann commented on TIKA-3093: -- yea we have lots of pipelines with OODT and Tika that does this already ([http://github.com/apache/drat/)] is a classic example of this... > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091703#comment-17091703 ] David Eric Pugh commented on TIKA-3093: --- Out of curiosity, is this type of behavior, the "Let me chain a set of interactions together" something that already exists? Imagine System A is a CMS, System B is Tika, and System C is Solr... What if I wanted to do something like "Send a request for a doc by id to system A, have it dig up doc by id in System A, then forward to System B for Extraction, and then forward to System C for storage".. Is there an already existing pattern for this that Tika could conform too? It feels like a pipe of some kind... > Enable tika-server to forward parse results to another endpoint > --- > > Key: TIKA-3093 > URL: https://issues.apache.org/jira/browse/TIKA-3093 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > bq. I see the "send the results to a remote network service" thing as > probably being separate from the Content Handler. > The above is from [~nick] on TIKA-2972. > It would be useful to allow users to forward the results of parsing to > another endpoint. For example, a user could specify a Solr > URL/update/json/docs handler or an elastic //_doc/<_id> > We may want to allow users to do custom mapping before redirecting to another > URL, whitelisting/blacklisting of metadata keys, etc. > I'd propose using /rmeta as the basis for this. > cc [~ehatcher] and [~dadoonet]. -- This message was sent by Atlassian Jira (v8.3.4#803005)