[
https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joern Kottmann updated OPENNLP-862:
-----------------------------------
Fix Version/s: 1.7.1
> BRAT format packages do not handle punctuation correctly when training NER
> model
> --------------------------------------------------------------------------------
>
> Key: OPENNLP-862
> URL: https://issues.apache.org/jira/browse/OPENNLP-862
> Project: OpenNLP
> Issue Type: Improvement
> Components: Formats
> Affects Versions: 1.6.0
> Reporter: Gregory Werner
> Priority: Minor
> Fix For: 1.7.1
>
>
> BRAT does not require preprocessing of text files in order to add annotations
> to text documents. And this is great because I can feed documents from
> corpora I am given directly into BRAT. If I have a line such as:
> Residence: Athens, Georgia
> I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would
> generate the offset and everything would be fine.
> It appears though that I only get 1 entity correctly processed (and the other
> dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the
> comma is not separated from Athens. I have 789 annotated raw, non
> pre-processed text documents from past efforts. I believe that OpenNLP should
> be able to handle lines like the above in the case of the BRAT format code.
> It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is
> what creates Athens, as a token. I find that the SimpleTokenizer might
> perform better with BRAT through my limited testing of raw documents if the
> current general approach is held.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)