[ 
https://issues.apache.org/jira/browse/UIMA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535851#comment-13535851
 ] 

Peter Klügl commented on UIMA-2524:
-----------------------------------

I have already done that. Tika is nice, but it wasn't really sufficient for the 
use cases I am targeting with that annotator. The HTMLAnnotator is the 
replacement of some functionality I removed after I contributed TextMarker to 
UIMA (the language support for html) and the HTML strippper the replacement for 
the HTML visualization provided by the CEV plugin, which was replaced by the 
CAS Editor. After all, it's just about handling old use cases/applications
                
> TextMarker html conversion to plain text is not working correctly
> -----------------------------------------------------------------
>
>                 Key: UIMA-2524
>                 URL: https://issues.apache.org/jira/browse/UIMA-2524
>             Project: UIMA
>          Issue Type: Bug
>          Components: TextMarker
>    Affects Versions: 2.0.0TextMarker
>            Reporter: Peter Klügl
>            Assignee: Peter Klügl
>
> The HTMLAnnoator shipped with TextMarker is able to strip the html tag and to 
> create an additional view with the plain text. During this step the tag 
> information is converted to annotations, whose offsets are adapted according 
> to the removed tags. This functionality is not working correctly: the tags of 
> the body of the html document are not removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to