[
https://issues.apache.org/jira/browse/SOLR-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934996#comment-15934996
]
Tim Allison commented on SOLR-9552:
-----------------------------------
Um, sure, we call it Tika 2.0. Do take a look at that branch and please do
contribute to the design. Let us know if that will meet your needs.
[~elyograg] already mentioned the key problem. Even if there is a clean break
in the design between native libs and non-native libs, and even though we're
now running Tika against a TB of data (3 million files) from Common Crawl
before we do a release, and even though we try to be as responsive as we
possibly can to JIRA issues about Tika behaving badly, some parser at some
point is going to do something awful (OOM blow out, permanent hang, execute
malicious code, slow burning memory leak, etc...things that cannot be handled
by catch blocks and asking a Thread to kindly stop), so it is better to keep
Tika in a separate JVM via tika-server or encapsulate it in another way:
tika-batch (spawns a child process) or ForkParser (child jar and child process).
In short, [~daddywri], I regret that there is no bright line between
well-behaved and ill-behaved parsers. There is a bright line between native
libs and non-Apache friendly parsers...users have to know about them and "opt
in". But anything else we can do to help ManifoldCF, let us know!
There are a few things that are holding up 2.0 (e.g., I've chosen to work on
tactical issues instead of the more strategic 2.0 stuff), but take a look at
the architecture and see if its something that makes sense.
As for GSOC, yes, please, please, please let a bright GSOC'er figure out how to
integrate Tika at a distance, whether that's through tika-server (SOLR-7632) or
some other means.
And, y, please let the same or another GSOC'er figure out how allow handling of
child documents and their metadata (SOLR-7229).
Oh, and if you have some extra GSOC cycles, there's always TIKA-1443. :)
I'm more than happy to chip in.
> Upgrade to Tika 1.14 when available
> -----------------------------------
>
> Key: SOLR-9552
> URL: https://issues.apache.org/jira/browse/SOLR-9552
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: contrib - DataImportHandler
> Reporter: Tim Allison
>
> Let's upgrade Solr as soon as 1.14 is available.
> P.S. I _think_ we're soon to wrap up work on 1.14. Any last requests?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]