[ 
https://issues.apache.org/jira/browse/SOLR-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606588#comment-13606588
 ] 

Hoss Man commented on SOLR-4530:
--------------------------------

Hmmm...

I applied this path URL to trunk and got a failure in the modified tests...
https://github.com/arafalov/lucene-solr/commit/bef2f84fd6943241c0f720f17011e5e42d919914.patch

{noformat}
[junit4:junit4]   2> 2321 T10 oas.SolrTestCaseJ4.tearDown ###Ending 
testTikaHTMLMapperIdentity
[junit4:junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity 
-Dtests.seed=699D812F169C4A5E -Dtests.slow=true -Dtests.locale=el 
-Dtests.timezone=America/Noronha -Dtests.file.encoding=UTF-8
[junit4:junit4] ERROR   0.11s J0 | 
TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<<
[junit4:junit4]    > Throwable #1: java.lang.RuntimeException: Exception during 
query
[junit4:junit4]    >    at 
__randomizedtesting.SeedInfo.seed([699D812F169C4A5E:39E205BEDFA8BFA3]:0)
[junit4:junit4]    >    at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:524)
[junit4:junit4]    >    at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:491)
[junit4:junit4]    >    at 
org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:101)
...
[junit4:junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//str[@name='text'][contains(.,'<H1>')]
[junit4:junit4]    >    xml response was: <?xml version="1.0" encoding="UTF-8"?>
[junit4:junit4]    > <response>
[junit4:junit4]    > <lst name="responseHeader"><int name="status">0</int><int 
name="QTime">1</int><lst name="params"><str name="start">0</str><str 
name="q">*:*</str><str name="qt">standard</str><str name="rows">20</str><str 
name="version">2.2</str></lst></lst><result name="response" numFound="1" 
start="0"><doc><str name="text">&lt;?xml version="1.0" 
encoding="UTF-8"?&gt;&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
[junit4:junit4]    > &lt;head&gt;
[junit4:junit4]    > &lt;meta name="Content-Encoding" content="ISO-8859-1"/&gt;
[junit4:junit4]    > &lt;meta name="Content-Type" content="text/html; 
charset=ISO-8859-1"/&gt;
[junit4:junit4]    > &lt;meta name="dc:title" content="Title in the header"/&gt;
[junit4:junit4]    > &lt;title&gt;Title in the header&lt;/title&gt;
[junit4:junit4]    > &lt;/head&gt;
[junit4:junit4]    > &lt;body&gt;
[junit4:junit4]    > &lt;h1&gt;H1 Header&lt;/h1&gt;
[junit4:junit4]    > 
[junit4:junit4]    > &lt;div&gt;Basic div&lt;/div&gt;
[junit4:junit4]    > 
[junit4:junit4]    > &lt;div class="classAttribute"&gt;Div with 
attribute&lt;/div&gt;
[junit4:junit4]    > 
[junit4:junit4]    > &lt;/body&gt;&lt;/html&gt;</str></doc></result>
[junit4:junit4]    > </response>
[junit4:junit4]    > 
[junit4:junit4]    >    request 
was:start=0&q=*:*&qt=standard&rows=20&version=2.2
[junit4:junit4]    >    at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:517)
[junit4:junit4]    >    ... 42 more
{noformat}

...suggesting maybe the comment about uppercasing/lowercasing tags in tika 
isn't consistent across platforms?  (or maybe you previously tested against a 
slightly diff version of tika and the behavior has changed?
                
> DIH: Provide configuration to use Tika's IdentityHtmlMapper
> -----------------------------------------------------------
>
>                 Key: SOLR-4530
>                 URL: https://issues.apache.org/jira/browse/SOLR-4530
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 4.1
>            Reporter: Alexandre Rafalovitch
>            Priority: Minor
>             Fix For: 4.3
>
>
> When using TikaEntityProcessor in DIH, the default HTML Mapper strips out 
> most of the HTML. It may make sense when the expectation is just to store the 
> extracted content as a text blob, but DIH allows more fine-tuned content 
> extraction (e.g. with nested XPathEntityProcessor).
> Recent Tika versions allow to set an alternative HTML Mapper implementation 
> that passes all the HTML in. It would be useful to be able to set that 
> implementation from DIH configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to