[
https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389201#comment-14389201
]
Peter Klügl edited comment on UIMA-4286 at 3/31/15 7:01 PM:
------------------------------------------------------------
That sounds overall very positive :-)
I will take a look at 4) and 6)
(I fixed the javadoc already, but it still needs improvement)
was (Author: pkluegl):
That sounds overall very positive :-)
I will take a look at 4) and 6)
> Ruta: HTMLConverter: Option to convert tags outside body tags
> -------------------------------------------------------------
>
> Key: UIMA-4286
> URL: https://issues.apache.org/jira/browse/UIMA-4286
> Project: UIMA
> Issue Type: Improvement
> Components: ruta
> Affects Versions: 2.2.1ruta
> Reporter: Mario Juric
> Assignee: Peter Klügl
> Fix For: 2.3.0ruta
>
>
> The HTML converter only converts tags that are found inside the body tag.
> Therefore some information carrying tags like citations get left out when
> applying the converter to XML articles with many metadata. It would be useful
> to add the option to have all tags converted since this would allow content
> outside the body to be parsed by natural language analysers as well.
> The converter was originally, as the name implies, conceived for HTML
> documents but together with the HTML Annotator it can this way be more
> generally useful in enabling NL parsing of a broader class of documents such
> as articles stored in XML documents.
> An example of how this option might work can be given by disabling the
> "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates
> what offsets to apply to such annotations but otherwise the document
> annotation offsets can be used. Empty tags can still be ignored but tags with
> only attributes and no content should preferably be converted.
> Experiments with disabling the "in body"-constraint reveals that there will
> be an additional need to separate the content metadata tags in the converted
> text view. An NL parser reading the text will in many case read different
> tags as one word or one sentence, which is not desirable. Some text delimiter
> should therefore be inserted between tags were required, which optionally
> could be customizable as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)