[jira] [Commented] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model

Gregory Werner (JIRA) Mon, 10 Oct 2016 06:11:28 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562263#comment-15562263
 ]


Gregory Werner commented on OPENNLP-862:
----------------------------------------

I have found that the arg {{-ruleBasedTokenizer simple}} is really what I need 
instead of altering the source code and this is how to address the "raw text" 
case.  I was just surprised that the Whitespace Tokenizer was the default 
choice if nothing in explicitly stated.

I don't know if there really is anything else to be added, I think the cases 
are pretty well covered.

> BRAT format packages do not handle punctuation correctly when training NER 
> model
> --------------------------------------------------------------------------------
>
>                 Key: OPENNLP-862
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-862
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Formats
>    Affects Versions: 1.6.0
>            Reporter: Gregory Werner
>
> BRAT does not require preprocessing of text files in order to add annotations 
> to text documents.  And this is great because I can feed documents from 
> corpora I am given directly into BRAT.  If I have a line such as:
> Residence:   Athens, Georgia
> I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would 
> generate the offset and everything would be fine.  
> It appears though that I only get 1 entity correctly processed (and the other 
> dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the 
> comma is not separated from Athens.  I have 789 annotated raw, non 
> pre-processed text documents from past efforts. I believe that OpenNLP should 
> be able to handle lines like the above in the case of the BRAT format code.
> It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is 
> what creates Athens, as a token.  I find that the SimpleTokenizer might 
> perform better with BRAT through my limited testing of raw documents if the 
> current general approach is held.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model

Reply via email to