[ 
https://issues.apache.org/jira/browse/UIMA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535821#comment-13535821
 ] 

Peter Klügl commented on UIMA-2524:
-----------------------------------

It is probably better to separate these two functionalities, annotating html 
files with annotations for the html tags and converting html files while 
retaining the annotations for the html tags. Thus, a new analysis engine for 
the second functionality can also be used on a CAS, which contains also other 
annotations. This would result in a useful analysis engine for converting CAS 
with html artifacts. I will refactor the HTMLAnnotator and remove the code for 
stripping the html tags, and I will create an new issue for the additional 
analysis engine.
                
> TextMarker html conversion to plain text is not working correctly
> -----------------------------------------------------------------
>
>                 Key: UIMA-2524
>                 URL: https://issues.apache.org/jira/browse/UIMA-2524
>             Project: UIMA
>          Issue Type: Bug
>          Components: TextMarker
>    Affects Versions: 2.0.0TextMarker
>            Reporter: Peter Klügl
>            Assignee: Peter Klügl
>
> The HTMLAnnoator shipped with TextMarker is able to strip the html tag and to 
> create an additional view with the plain text. During this step the tag 
> information is converted to annotations, whose offsets are adapted according 
> to the removed tags. This functionality is not working correctly: the tags of 
> the body of the html document are not removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to