[ 
https://issues.apache.org/jira/browse/SOLR-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934966#comment-15934966
 ] 

Tim Allison commented on SOLR-9552:
-----------------------------------

bq. So, is there an easy way to figure out what IS included?

No.  That's another aspect of the Solr integration that unsettles me.  Users 
think they're getting the full Tika, but they aren't.

With each update, the committer decides which dependencies and therefore which 
parsers to include.  From what I can tell, Solr does not include:
1) Anything with "optional" dependencies that are not consistent with Apache 
2.0 (makes complete sense)
2) Anything with native libs
3) Image parsers 
([webp|https://issues.apache.org/jira/browse/SOLR-8981?focusedCommentId=15336790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15336790]),
 except when it does, e.g. drewnoakes' metadata-extractor or Tesseract if the 
user has it installed.

I'm frankly not sure what happens with Grobid or some of the "advanced" 
parsers. See below.

bq.  I am guessing Tika has a bunch of extension libraries that Solr does not 
include by default.
To be fair to Solr, there aren't that many parsers that are missing.

 * Sqlite -- native libs, users of Tika have to remember to add xerial's jar to 
their classpath anyhow...So do Solr users.
 * Image types that require non-ASL friendly licenses (see 
[PDFBox|https://pdfbox.apache.org/2.0/dependencies.html#optional-components]); 
again, Tika users have to put those on their classpath, too
 * Webp, mentioned above
 * Grobid (?)
 * CTakes (?)
 * envi, gdal, geo.topic, geoinfo (??)
 * PooledTimeSeriesParser (??)
 * Tensorflow object recognizer (??)

[~chrismattmann], any idea what happens to these in Solr?

> Upgrade to Tika 1.14 when available
> -----------------------------------
>
>                 Key: SOLR-9552
>                 URL: https://issues.apache.org/jira/browse/SOLR-9552
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - DataImportHandler
>            Reporter: Tim Allison
>
>  Let's upgrade Solr as soon as 1.14 is available.
> P.S. I _think_ we're soon to wrap up work on 1.14.  Any last requests? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to