- Multiple sentences in a line
oops!!! i sent a sample from the wrong training data!!! I tried so many combinations that i've ended up with several training data files...i do have one sentence per line in the correct file. In fact i've got a function that does it for me.

-You data is not tokenized

what exactly do you mean? the training data should be tokenized? I'm not sure i follow...
its after the training that i need to tokenize in order to do NER isn't it?

- Adaptive data is not cleared

Is that necessary since i'm treating the merged document as a single document? I mean i can do it but will that make any noticeable difference? I can easily append a newline when cat-ing the papers...

Do you use our command line tools for training?

No, i'm writing my own application that uses the API...

Jim


On 08/02/12 19:09, Jörn Kottmann wrote:
I see the following issues:
- Multiple sentences in a line
- You data is not tokenized
- Adaptive data is not cleared

You can use our sentence detector to split
your paragraphs. If you know your document
boundaries you should write an empty line to
that file to clear the adaptive data. If you cannot
do that write an empty line after every sentence.

Do you use our command line tools for training?

Jörn

On 02/08/2012 06:46 PM, Jim - FooBar(); wrote:
Would it be possible for you to show us a sample of your training data?
Maybe one paper.

Absolutely here you go....a sample has been attached...Let me know if you want more but i can assure you that since the sgml tags are generated automatically (with regex replacement) they are all of the same format...

Jim

p.s: fire up your favourite editor press ctrl+f and search for "<START" just to see locate them easily!


On 08/02/12 17:09, Joern Kottmann wrote:
On Wed, Feb 8, 2012 at 5:56 PM, Jim - FooBar();<jimpil1...@gmail.com>wrote:

aaa ok i see what you mean...but then again if it recognised it as a mere token it would not throw "IncompatibleFormat" exceptions but rather skip it as a token that is not of interest wouldn't it? I don't have any patches to send you, i just think that not including spaces in the sgml tag is a more
wise approach...Unless of course you're extracting the sgml tags via
regex...The truth is i've not looked at the source but i would expect you to use some sort of xml-ish means to extract the sgml tags. If your parser
is using regex then i'm sure you have your reasons for including the
spaces. But anyway, this is a very small problem for me cos i can indeed
sort it manually...My big problem still remains!!!

The code splits the input string by line and then by white space. Then the
individual parts either
match our start and end tags or not.



Anyway I'll stop bugging you...the fact that you tried to help means a lot and certainly if i sort everything out i'll post what the problem was for
future users...


We are also interested why it does not work for you, we usually use this
kind of experience to
improve OpenNLP.

Would it be possible for you to show us a sample of your training data?
Maybe one paper.

Jörn




Reply via email to