Re: Problem with openNLP Name Finder API....

Jim - FooBar(); Wed, 08 Feb 2012 11:22:02 -0800

- Multiple sentences in a line

oops!!! i sent a sample from the wrong training data!!! I tried so manycombinations that i've ended up with several training data files...i dohave one sentence per line in the correct file. In fact i've got afunction that does it for me.

-You data is not tokenized

what exactly do you mean? the training data should be tokenized? I'm notsure i follow...

its after the training that i need to tokenize in order to do NER isn't it?

- Adaptive data is not cleared

Is that necessary since i'm treating the merged document as a singledocument? I mean i can do it but will that make any noticeabledifference? I can easily append a newline when cat-ing the papers...

Do you use our command line tools for training?


No, i'm writing my own application that uses the API...

Jim


On 08/02/12 19:09, Jörn Kottmann wrote:

I see the following issues:
- Multiple sentences in a line
- You data is not tokenized
- Adaptive data is not cleared

You can use our sentence detector to split
your paragraphs. If you know your document
boundaries you should write an empty line to
that file to clear the adaptive data. If you cannot
do that write an empty line after every sentence.

Do you use our command line tools for training?

Jörn

On 02/08/2012 06:46 PM, Jim - FooBar(); wrote:
Would it be possible for you to show us a sample of your training data?
Maybe one paper.
Absolutely here you go....a sample has been attached...Let me know ifyou want more but i can assure you that since the sgml tags aregenerated automatically (with regex replacement) they are all of thesame format...
Jim
p.s: fire up your favourite editor press ctrl+f and search for"<START" just to see locate them easily!
On 08/02/12 17:09, Joern Kottmann wrote:
On Wed, Feb 8, 2012 at 5:56 PM, Jim -FooBar();<jimpil1...@gmail.com>wrote:
aaa ok i see what you mean...but then again if it recognised it asa meretoken it would not throw "IncompatibleFormat" exceptions but ratherskip itas a token that is not of interest wouldn't it? I don't have anypatches tosend you, i just think that not including spaces in the sgml tag isa more
wise approach...Unless of course you're extracting the sgml tags via
regex...The truth is i've not looked at the source but i wouldexpect youto use some sort of xml-ish means to extract the sgml tags. If yourparser
is using regex then i'm sure you have your reasons for including the
spaces. But anyway, this is a very small problem for me cos i canindeed
sort it manually...My big problem still remains!!!
The code splits the input string by line and then by white space.Then the
individual parts either
match our start and end tags or not.
Anyway I'll stop bugging you...the fact that you tried to helpmeans a lotand certainly if i sort everything out i'll post what the problemwas for
future users...
We are also interested why it does not work for you, we usually usethis
kind of experience to
improve OpenNLP.

Would it be possible for you to show us a sample of your training data?
Maybe one paper.

Jörn

Re: Problem with openNLP Name Finder API....

Reply via email to