[
https://issues.apache.org/jira/browse/UIMA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535821#comment-13535821
]
Peter Klügl commented on UIMA-2524:
-----------------------------------
It is probably better to separate these two functionalities, annotating html
files with annotations for the html tags and converting html files while
retaining the annotations for the html tags. Thus, a new analysis engine for
the second functionality can also be used on a CAS, which contains also other
annotations. This would result in a useful analysis engine for converting CAS
with html artifacts. I will refactor the HTMLAnnotator and remove the code for
stripping the html tags, and I will create an new issue for the additional
analysis engine.
> TextMarker html conversion to plain text is not working correctly
> -----------------------------------------------------------------
>
> Key: UIMA-2524
> URL: https://issues.apache.org/jira/browse/UIMA-2524
> Project: UIMA
> Issue Type: Bug
> Components: TextMarker
> Affects Versions: 2.0.0TextMarker
> Reporter: Peter Klügl
> Assignee: Peter Klügl
>
> The HTMLAnnoator shipped with TextMarker is able to strip the html tag and to
> create an additional view with the plain text. During this step the tag
> information is converted to annotations, whose offsets are adapted according
> to the removed tags. This functionality is not working correctly: the tags of
> the body of the html document are not removed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira