[jira] [Comment Edited] (UIMA-4286) Ruta: HTMLConverter: Option to convert tags outside body tags

JIRA Tue, 31 Mar 2015 12:02:12 -0700

    [ 
https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389201#comment-14389201
 ]


Peter Klügl edited comment on UIMA-4286 at 3/31/15 7:01 PM:
------------------------------------------------------------

That sounds overall very positive :-)

I will take a look at 4) and 6)

(I fixed the javadoc already, but it still needs improvement)


was (Author: pkluegl):
That sounds overall very positive :-)

I will take a look at 4) and 6)

> Ruta: HTMLConverter: Option to convert tags outside body tags
> -------------------------------------------------------------
>
>                 Key: UIMA-4286
>                 URL: https://issues.apache.org/jira/browse/UIMA-4286
>             Project: UIMA
>          Issue Type: Improvement
>          Components: ruta
>    Affects Versions: 2.2.1ruta
>            Reporter: Mario Juric
>            Assignee: Peter Klügl
>             Fix For: 2.3.0ruta
>
>
> The HTML converter only converts tags that are found inside the body tag. 
> Therefore some information carrying tags like citations get left out when 
> applying the converter to XML articles with many metadata. It would be useful 
> to add the option to have all tags converted since this would allow content 
> outside the body to be parsed by natural language analysers as well.
> The converter was originally, as the name implies, conceived for HTML 
> documents but together with the HTML Annotator it can this way be more 
> generally useful in enabling NL parsing of a broader class of documents such 
> as articles stored in XML documents.
> An example of how this option might work can be given by disabling the 
> "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates 
> what offsets to apply to such annotations but otherwise the document 
> annotation offsets can be used. Empty tags can still be ignored but tags with 
> only attributes and no content should preferably be converted.
> Experiments with disabling the "in body"-constraint reveals that there will 
> be an additional need to separate the content metadata tags in the converted 
> text view. An NL parser reading the text will in many case read different 
> tags as one word or one sentence, which is not desirable. Some text delimiter 
> should therefore be inserted between tags were required, which optionally 
> could be customizable as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (UIMA-4286) Ruta: HTMLConverter: Option to convert tags outside body tags

Reply via email to