[ 
https://issues.apache.org/jira/browse/SOLR-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606594#comment-13606594
 ] 

Alexandre Rafalovitch commented on SOLR-4530:
---------------------------------------------

Could be different version of Tika, as I tested it against Solr 4.1 originally. 
I will retest. Should I be retesting against trunk or against 4.2 (4.2.1? 4.3?) 
if I want this make it to a 4.x sub-release?
                
> DIH: Provide configuration to use Tika's IdentityHtmlMapper
> -----------------------------------------------------------
>
>                 Key: SOLR-4530
>                 URL: https://issues.apache.org/jira/browse/SOLR-4530
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 4.1
>            Reporter: Alexandre Rafalovitch
>            Priority: Minor
>             Fix For: 4.3
>
>
> When using TikaEntityProcessor in DIH, the default HTML Mapper strips out 
> most of the HTML. It may make sense when the expectation is just to store the 
> extracted content as a text blob, but DIH allows more fine-tuned content 
> extraction (e.g. with nested XPathEntityProcessor).
> Recent Tika versions allow to set an alternative HTML Mapper implementation 
> that passes all the HTML in. It would be useful to be able to set that 
> implementation from DIH configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to