[ 
https://issues.apache.org/jira/browse/SOLR-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934996#comment-15934996
 ] 

Tim Allison commented on SOLR-9552:
-----------------------------------

Um, sure, we call it Tika 2.0. Do take a look at that branch and please do 
contribute to the design.  Let us know if that will meet your needs.

[~elyograg] already mentioned the key problem.  Even if there is a clean break 
in the design between native libs and non-native libs, and even though we're 
now running Tika against a TB of data (3 million files) from Common Crawl 
before we do a release, and even though we try to be as responsive as we 
possibly can to JIRA issues about Tika behaving badly, some parser at some 
point is going to do something awful (OOM blow out, permanent hang, execute 
malicious code, slow burning memory leak, etc...things that cannot be handled 
by catch blocks and asking a Thread to kindly stop), so it is better to keep 
Tika in a separate JVM via tika-server or encapsulate it in another way: 
tika-batch (spawns a child process) or ForkParser (child jar and child process).

In short, [~daddywri], I regret that there is no bright line between 
well-behaved and ill-behaved parsers.  There is a bright line between native 
libs and non-Apache friendly parsers...users have to know about them and "opt 
in".  But anything else we can do to help ManifoldCF, let us know!

There are a few things that are holding up 2.0 (e.g., I've chosen to work on 
tactical issues instead of the more strategic 2.0 stuff), but take a look at 
the architecture and see if its something that makes sense.

As for GSOC, yes, please, please, please let a bright GSOC'er figure out how to 
integrate Tika at a distance, whether that's through tika-server (SOLR-7632) or 
some other means.

And, y, please let the same or another GSOC'er figure out how allow handling of 
child documents and their metadata (SOLR-7229).

Oh, and if you have some extra GSOC cycles, there's always TIKA-1443. :)
I'm more than happy to chip in.

> Upgrade to Tika 1.14 when available
> -----------------------------------
>
>                 Key: SOLR-9552
>                 URL: https://issues.apache.org/jira/browse/SOLR-9552
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - DataImportHandler
>            Reporter: Tim Allison
>
>  Let's upgrade Solr as soon as 1.14 is available.
> P.S. I _think_ we're soon to wrap up work on 1.14.  Any last requests? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to