[jira] [Updated] (SOLR-6488) Upgrade to TIKA 1.6

Uwe Schindler (JIRA) Sat, 06 Sep 2014 02:56:30 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated SOLR-6488:
--------------------------------
    Attachment: SOLR-6488.patch

Initial patch with updated lib versions.
There are still some dependencies for crazy parsers missing, I will review.

The current test-suite fails, because some of the parsers seem to add a new 
metadata field:
{noformat}
   [junit4] Started J0 PID(7180@VEGA).
   [junit4] Suite: 
org.apache.solr.handler.extraction.ExtractingRequestHandlerTest
   [junit4]   2> Creating dataDir: C:\Users\Uwe 
Schindler\Projects\lucene\trunk-lusolr1\solr\build\contrib\solr-cell\test\J0\.\temp\
solr.handler.extraction.ExtractingRequestHandlerTest-3D229694F89D0471-001\init-core-data-001
   [junit4]   2> log4j:WARN No appenders could be found for logger 
(org.apache.solr.SolrTestCaseJ4).
   [junit4]   2> log4j:WARN Please initialize the log4j system properly.
   [junit4]   2> log4j:WARN See 
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testLiterals 
-Dtests.seed=3D
229694F89D0471 -Dtests.locale=sq -Dtests.timezone=SystemV/AST4ADT 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.14s | ExtractingRequestHandlerTest.testLiterals <<<
   [junit4]    > Throwable #1: org.apache.solr.common.SolrException: ERROR: 
[doc=three] unknown field 'X-Parsed-By'
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([3D229694F89D0471:D30A12C89093A342]:0)
   [junit4]    >        at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
   [junit4]    >        at 
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:79)
   [junit4]    >        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
   [junit4]    >        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
   [junit4]    >        at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
   [junit4]    >        at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
   [junit4]    >        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:89
5)
   [junit4]    >        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:69
2)
   [junit4]    >        at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
   [junit4]    >        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
   [junit4]    >        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
   [junit4]    >        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
   [junit4]    >        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   [junit4]    >        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   [junit4]    >        at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:1985)
   [junit4]    >        at 
org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:317)
   [junit4]    >        at 
org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.ja
va:619)
   [junit4]    >        at 
org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testLiterals(ExtractingRequestHandlerTest
.java:275)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=ExtractingRequestHandlerTest 
-Dtests.method=testPlainTextSpecifyingResou
rceName -Dtests.seed=3D229694F89D0471 -Dtests.locale=sq 
-Dtests.timezone=SystemV/AST4ADT -Dtests.file.encoding=US-ASCII
{noformat}

I have not yet verified what this field contains (maybe TIKA adds it with a 
static value, in that case we should ignore it (because we don't need to add 
the same field always with same content to index.

In addition, dom4j was removed from TIKA, but there is still something in 
solr-core that needs dom4j.jar. This is a really outdated and no longer useable 
lib. Can we nuke it. But Solr itsself is not using it, so I think maybe hadoop? 
If [[email protected]] has an idea who depends on this, I would be happy. 
Also the dependency validator complains about a circular dep:
{noformat}
[libversions]   circular dependency found: 
dom4j#dom4j;1.6.1->jaxen#jaxen;1.1-beta-6->dom4j#dom4j;1.5.2
{noformat}

In addition common-scompress was updated to 1.8.1, it is used at other places, 
too. I hope this does not conflict with any Solr-internal code.

> Upgrade to TIKA 1.6
> -------------------
>
>                 Key: SOLR-6488
>                 URL: https://issues.apache.org/jira/browse/SOLR-6488
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Uwe Schindler
>             Fix For: 5.0, 4.11
>
>         Attachments: SOLR-6488.patch
>
>
> Apache TIKA 1.6 came out yesterday, we should upgrade it.
> The dependencies of bundled Apache POI changed (xmlbeans upgraded, already 
> done. dom4j is obsolete). We have to carefully verify the dependency tree!!!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-6488) Upgrade to TIKA 1.6

Reply via email to