[
https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702
]
Daniel Bonniot de Ruisselet commented on TIKA-1017:
---------------------------------------------------
If we want to preserve the semantics, maybe at least SUB and SUP should be
added? For instance in a scientific document, "a<sup>b</sup>" and
"a<sub>b</sub>" might be different concepts, which are lost if you only get "a
b".
If we want to keep all "safe" elements, we could also add at least I, B, EM and
STRONG.
It's easy enough to use another mapper, so this is not a huge issue, but a good
default is always nice. Given the above, maybe only add SUB and SUP?
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based
> on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this
> intentional (a comment with the rationale would be useful) or should they be
> added?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)