[ 
https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965278#comment-16965278
 ] 

Nick Burch commented on TIKA-2972:
----------------------------------

I see the "send the results to a remote network service" thing as probably 
being separate from the Content Handler.

We'll need some way to configure the server for what URLs it will allow onward 
sending, and with what headers being set or passed along, to reduce the 
security attack surface.

Once that's in place, I'd lean towards a new "extact-and-forward" endpoint that 
would accept the file + dest URL + parameters, check it's allowed, parse with 
Tika as normal, then send the results on. I'd probably allow the security 
settings to be done either by Tika Config (for the setting up manually case), 
or via some sort of properties (for the SOLR/ES forking pet copies of Tika case)

> Allow users to specify a list/map of ContentHandlerFactories in 
> tika-config.xml
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-2972
>                 URL: https://issues.apache.org/jira/browse/TIKA-2972
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> I'd like to add a tika-eval handler that will calculate text stats at the end 
> of parsing a document so that the user  can get a unified/simpler view of 
> number of tokens/ out of vocabulary, etc. in the metadata rather than having 
> to run their own post-parse process on the content.
> The problem comes with integrating this into tika-app and tika-server -- 
> tika-app balloons to 134MB.  I don't want to nearly double the size of 
> tika-app just so that I can add some stuff that very few folks will use.
> I think we've discussed this option before, but it would be handy to allow 
> users to specify a ContentHandlerFactory or possibly a map of 
> ContentHandlerFactories in tika-config.xml so that users can get custom 
> handling in tika-app and tika-server.
> The idea of a map of ContentHandlerFactories, would be to have a name for 
> each content handler factory, and a user could call different handlers on 
> tika-server like this:
> -{{curl... http://localhost:9998/tika/custom/myhandler1}}-
> -{{curl... http://localhost:9998/tika/custom/myhandler2}}-
> That's not right because we'd want to differentiate classic Tika parsing and 
> the RecursiveParserWrapper...
> {{curl... http://localhost:9998/tika/myhandler1}}
> {{curl... http://localhost:9998/tika/myhandler2}}
> {{curl... http://localhost:9998/rmeta/myhandler1}}
> {{curl... http://localhost:9998/rmeta/myhandler2}}
> or in tika-app:
> {{java -jar tika-app.jar --handlerFactory=myhandler1...}}
> WDYT?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to