[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024669#comment-18024669
 ] 

Eric Pugh commented on SOLR-7632:
---------------------------------

I had a great discussion with [~tallison] the other week, and he crystallized 
something about tika server versus tika pipes.   

Tika Server is a perfectly fine solution in a distributed cloud environment.   
Following our current architecture...   YOu could imainge a pod of tika servers 
with a load balancer in front if you were ingesting at scale.    

However....!   If you want to run extraction on a local server without the 
cloud infrastructure, this is where tika pipes comes in.  It eliminates the 
existing challenges that our current "local" implementation has that the java 
process that is doing extraction is the same java process supporting Solr.   
Plus all the jars that we need ship with Solr.     

Instead...  With Tika Pipes, the Solr process talkes to Tika pipes who spawns a 
completely NEW java process that does extraction.  The child process and 
Tika/Solr communicate via stdio which means the class path of Solr doesn't need 
any of the jars or depenedencies that the child tika process needs for 
extraction.   they each have their own classpath.   And if something goes 
wrong, well the hcild process crases/gets reaped, but Tika/Solr continues on 
it's merry way.    

 

To set up Tika Pipes, you do some config, (like we do for anything else) and 
what [~tallison] and I spitballed is a tika pipes parameter pointing the child 
process to a download of tika-standard-server-x.yx.jar file.    So, to get all 
your tika dependencies, you just go grab that massive 63 mb jar file and point 
to it.  No more CVE's for Solr project, there is a very small set of tika libs 
we need for Tika/Solr.  Want NLP capablities, just go grab that tika jar and 
add it to the custom classpath for the child process.

Since this all runs on your local server, you don't need another complete 
process, and may be more efficient depending on your workloads.   

Did I capture this [~tallison] ?

 

So, in terms of our path, I think [~janhoy] that you are on the correct path.  
If we land the current PR, then we could in a seperate PR migrate our "local" 
plugin to Tika Pipes, which will give us the best of all worlds!

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7632
>                 URL: https://issues.apache.org/jira/browse/SOLR-7632
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>            Assignee: Jan Høydahl
>            Priority: Major
>              Labels: gsoc2017, memex, pull-request-available
>          Time Spent: 8h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to