[
https://issues.apache.org/jira/browse/SOLR-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606588#comment-13606588
]
Hoss Man commented on SOLR-4530:
--------------------------------
Hmmm...
I applied this path URL to trunk and got a failure in the modified tests...
https://github.com/arafalov/lucene-solr/commit/bef2f84fd6943241c0f720f17011e5e42d919914.patch
{noformat}
[junit4:junit4] 2> 2321 T10 oas.SolrTestCaseJ4.tearDown ###Ending
testTikaHTMLMapperIdentity
[junit4:junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity
-Dtests.seed=699D812F169C4A5E -Dtests.slow=true -Dtests.locale=el
-Dtests.timezone=America/Noronha -Dtests.file.encoding=UTF-8
[junit4:junit4] ERROR 0.11s J0 |
TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<<
[junit4:junit4] > Throwable #1: java.lang.RuntimeException: Exception during
query
[junit4:junit4] > at
__randomizedtesting.SeedInfo.seed([699D812F169C4A5E:39E205BEDFA8BFA3]:0)
[junit4:junit4] > at
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:524)
[junit4:junit4] > at
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:491)
[junit4:junit4] > at
org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:101)
...
[junit4:junit4] > Caused by: java.lang.RuntimeException: REQUEST FAILED:
xpath=//str[@name='text'][contains(.,'<H1>')]
[junit4:junit4] > xml response was: <?xml version="1.0" encoding="UTF-8"?>
[junit4:junit4] > <response>
[junit4:junit4] > <lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int><lst name="params"><str name="start">0</str><str
name="q">*:*</str><str name="qt">standard</str><str name="rows">20</str><str
name="version">2.2</str></lst></lst><result name="response" numFound="1"
start="0"><doc><str name="text"><?xml version="1.0"
encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
[junit4:junit4] > <head>
[junit4:junit4] > <meta name="Content-Encoding" content="ISO-8859-1"/>
[junit4:junit4] > <meta name="Content-Type" content="text/html;
charset=ISO-8859-1"/>
[junit4:junit4] > <meta name="dc:title" content="Title in the header"/>
[junit4:junit4] > <title>Title in the header</title>
[junit4:junit4] > </head>
[junit4:junit4] > <body>
[junit4:junit4] > <h1>H1 Header</h1>
[junit4:junit4] >
[junit4:junit4] > <div>Basic div</div>
[junit4:junit4] >
[junit4:junit4] > <div class="classAttribute">Div with
attribute</div>
[junit4:junit4] >
[junit4:junit4] > </body></html></str></doc></result>
[junit4:junit4] > </response>
[junit4:junit4] >
[junit4:junit4] > request
was:start=0&q=*:*&qt=standard&rows=20&version=2.2
[junit4:junit4] > at
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:517)
[junit4:junit4] > ... 42 more
{noformat}
...suggesting maybe the comment about uppercasing/lowercasing tags in tika
isn't consistent across platforms? (or maybe you previously tested against a
slightly diff version of tika and the behavior has changed?
> DIH: Provide configuration to use Tika's IdentityHtmlMapper
> -----------------------------------------------------------
>
> Key: SOLR-4530
> URL: https://issues.apache.org/jira/browse/SOLR-4530
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Affects Versions: 4.1
> Reporter: Alexandre Rafalovitch
> Priority: Minor
> Fix For: 4.3
>
>
> When using TikaEntityProcessor in DIH, the default HTML Mapper strips out
> most of the HTML. It may make sense when the expectation is just to store the
> extracted content as a text blob, but DIH allows more fine-tuned content
> extraction (e.g. with nested XPathEntityProcessor).
> Recent Tika versions allow to set an alternative HTML Mapper implementation
> that passes all the HTML in. It would be useful to be able to set that
> implementation from DIH configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]