[
https://issues.apache.org/jira/browse/SOLR-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608565#comment-13608565
]
Alexandre Rafalovitch commented on SOLR-4530:
---------------------------------------------
The case issue was apparently a bug, fixed in TIKA-869.
I fixed that and applied changes to trunk. Patch is included, tests seem to
pass.
> DIH: Provide configuration to use Tika's IdentityHtmlMapper
> -----------------------------------------------------------
>
> Key: SOLR-4530
> URL: https://issues.apache.org/jira/browse/SOLR-4530
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Affects Versions: 4.1
> Reporter: Alexandre Rafalovitch
> Priority: Minor
> Fix For: 4.3
>
> Attachments: SOLR-4530.patch
>
>
> When using TikaEntityProcessor in DIH, the default HTML Mapper strips out
> most of the HTML. It may make sense when the expectation is just to store the
> extracted content as a text blob, but DIH allows more fine-tuned content
> extraction (e.g. with nested XPathEntityProcessor).
> Recent Tika versions allow to set an alternative HTML Mapper implementation
> that passes all the HTML in. It would be useful to be able to set that
> implementation from DIH configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]