- Multiple sentences in a line
oops!!! i sent a sample from the wrong training data!!! I tried so many
combinations that i've ended up with several training data files...i do
have one sentence per line in the correct file. In fact i've got a
function that does it for me.
-You data is not tokenized
what exactly do you mean? the training data should be tokenized? I'm not
sure i follow...
its after the training that i need to tokenize in order to do NER isn't it?
- Adaptive data is not cleared
Is that necessary since i'm treating the merged document as a single
document? I mean i can do it but will that make any noticeable
difference? I can easily append a newline when cat-ing the papers...
Do you use our command line tools for training?
No, i'm writing my own application that uses the API...
Jim
On 08/02/12 19:09, Jörn Kottmann wrote:
I see the following issues:
- Multiple sentences in a line
- You data is not tokenized
- Adaptive data is not cleared
You can use our sentence detector to split
your paragraphs. If you know your document
boundaries you should write an empty line to
that file to clear the adaptive data. If you cannot
do that write an empty line after every sentence.
Do you use our command line tools for training?
Jörn
On 02/08/2012 06:46 PM, Jim - FooBar(); wrote:
Would it be possible for you to show us a sample of your training data?
Maybe one paper.
Absolutely here you go....a sample has been attached...Let me know if
you want more but i can assure you that since the sgml tags are
generated automatically (with regex replacement) they are all of the
same format...
Jim
p.s: fire up your favourite editor press ctrl+f and search for
"<START" just to see locate them easily!
On 08/02/12 17:09, Joern Kottmann wrote:
On Wed, Feb 8, 2012 at 5:56 PM, Jim -
FooBar();<jimpil1...@gmail.com>wrote:
aaa ok i see what you mean...but then again if it recognised it as
a mere
token it would not throw "IncompatibleFormat" exceptions but rather
skip it
as a token that is not of interest wouldn't it? I don't have any
patches to
send you, i just think that not including spaces in the sgml tag is
a more
wise approach...Unless of course you're extracting the sgml tags via
regex...The truth is i've not looked at the source but i would
expect you
to use some sort of xml-ish means to extract the sgml tags. If your
parser
is using regex then i'm sure you have your reasons for including the
spaces. But anyway, this is a very small problem for me cos i can
indeed
sort it manually...My big problem still remains!!!
The code splits the input string by line and then by white space.
Then the
individual parts either
match our start and end tags or not.
Anyway I'll stop bugging you...the fact that you tried to help
means a lot
and certainly if i sort everything out i'll post what the problem
was for
future users...
We are also interested why it does not work for you, we usually use
this
kind of experience to
improve OpenNLP.
Would it be possible for you to show us a sample of your training data?
Maybe one paper.
Jörn