[ 
https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gregory Werner updated OPENNLP-862:
-----------------------------------
    Description: 
BRAT does not require preprocessing of text files in order to add annotations 
to text documents.  And this is great because I can feed documents from corpora 
I am given directly into BRAT.  If I have a line such as:

Residence:   Athens, Georgia

I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would 
generate the offset and everything would be fine.  

It appears though that I only get 1 entity correctly processed (and the other 
dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the 
comma is not separated from Athens.  I have 789 annotated raw, non 
pre-processed text documents from past efforts. I believe that OpenNLP should 
be able to handle lines like the above in the case of the BRAT format code.

It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is 
what creates Athens, as a token.

  was:
BRAT does not require preprocessing of text files in order to add annotations 
to text documents.  And this is great because I can feed documents from corpora 
I am given directly into BRAT.  If I have a line such as:

Residence:   Athens, Georgia

I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would 
generate the offset and everything would be fine.  

It appears though that I only get 1 entity correctly processed (and the other 
dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the 
comma is not separated from Athens.  I have 789 annotated raw, non 
pre-processed text documents from past efforts. I believe that OpenNLP should 
be able to handle lines like the above in the case of the BRAT format code.


> BRAT format packages do not handle punctuation correctly when training NER 
> model
> --------------------------------------------------------------------------------
>
>                 Key: OPENNLP-862
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-862
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Formats
>    Affects Versions: 1.6.0
>            Reporter: Gregory Werner
>
> BRAT does not require preprocessing of text files in order to add annotations 
> to text documents.  And this is great because I can feed documents from 
> corpora I am given directly into BRAT.  If I have a line such as:
> Residence:   Athens, Georgia
> I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would 
> generate the offset and everything would be fine.  
> It appears though that I only get 1 entity correctly processed (and the other 
> dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the 
> comma is not separated from Athens.  I have 789 annotated raw, non 
> pre-processed text documents from past efforts. I believe that OpenNLP should 
> be able to handle lines like the above in the case of the BRAT format code.
> It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is 
> what creates Athens, as a token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to