[
https://issues.apache.org/jira/browse/OPENNLP-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920788#comment-16920788
]
ASF GitHub Bot commented on OPENNLP-565:
----------------------------------------
zameji commented on issue #364: OPENNLP-565 Support for the MASC format
URL: https://github.com/apache/opennlp/pull/364#issuecomment-527113308
The suggested change allows using the MASC corpus in its current
distribution in the stand-off annotation format. Supported annotations are:
- Sentences
- Tokens
- POS-tags
- Named entities
The module was tested against the whole MASC corpus, with adaptations made
in order to handle errors in the corpus or compatibility issues between OpenNLP
and MASC (e.g. named entity overlaps, sentence overlaps, zero-token sentences,
etc.).
Because MASC is currently distributed according to CC-BY 3.0 (ASF Category B
license), a set of fake files is included for testing purposes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add MASC format support
> -----------------------
>
> Key: OPENNLP-565
> URL: https://issues.apache.org/jira/browse/OPENNLP-565
> Project: OpenNLP
> Issue Type: Improvement
> Components: Formats
> Reporter: Joern Kottmann
> Priority: Minor
> Labels: help-wanted
>
> Add format support for the MASC corpus. The corpus contains annotations for
> most of the components in OpenNLP and would be a great source of freely
> available training data for testing.
> The corpus can be found here:
> http://www.anc.org/MASC/About.html#format
--
This message was sent by Atlassian Jira
(v8.3.2#803003)