That is how we defined the training format. The<START> and<END> tag MUST
be white space separated,
otherwise it will be recognized as a token.
I think you meant it *WON'T* be recognised as a token...a mere typo or
have i misunderstood?
I mean its pretty obvious that spaces are necessary from all the
exceptions thrown while counting the events! I'm just wondering why have
you chosen to do that in the first place?No spaces makes a lot more
sense to me regardless of the problem i am having with regex replacement...
Have a look at our documentation. The NER code you see there is correct.
If you have problems to detect multi-token names I suspect that something
with your training data is wrong.
I've spent the last 2 weeks reading the docs and i 've practically read
all the external sources on stackoverflow and otehr sources. However
everyone's demo is about the same thing shown in the docs which is the
person name finder (which happens to include a multi-word token - Pierre
Vinken)! As far as the training data is concerned i've systematically
checked it!!! I had to because as i said in the beginning i had nested
tags which had to be sorted manually...I spent a whole day doing that
but at least i was thinking "Finally i am so close to training...!!!".
On top of that if there was something wrong with my training data i
would expect exceptions again but i 'm not getting any since i sorted
out the nested tags !!!
The Name Finder takes a tokenized sentence at a time. After you are done
with a document
you should clear the adaptive data.
In order to avoid doing that i have merged all 383 papers into a single
one with "cat *.txt -> merged.txt" and i'm treating it as a single
document...
Is that a problem? I don't see how it could be...
Regards,
Jim
On 08/02/12 15:29, Joern Kottmann wrote:
On Mon, Feb 6, 2012 at 6:15 PM, Jim - FooBar();<jimpil1...@gmail.com>wrote:
Now the easy and straightforward solution to that is to NOT add spaces in
the tag like this :
<START:drug>Folic acid<END> (this will NOT match "acid" in later parsing)
I honestly wasn't expecting that to make any difference to the training
process but as it turns out it breaks it completely.
That is how we defined the training format. The<START> and<END> tag MUST
be white space separated,
otherwise it will be recognized as a token.
Jörn