That is how we defined the training format. The<START>  and<END>  tag MUST
be white space separated,
otherwise it will be recognized as a token.
I think you meant it *WON'T* be recognised as a token...a mere typo or have i misunderstood? I mean its pretty obvious that spaces are necessary from all the exceptions thrown while counting the events! I'm just wondering why have you chosen to do that in the first place?No spaces makes a lot more sense to me regardless of the problem i am having with regex replacement...

Have a look at our documentation. The NER code you see there is correct.
If you have problems to detect multi-token names I suspect that something
with your training data is wrong.
I've spent the last 2 weeks reading the docs and i 've practically read all the external sources on stackoverflow and otehr sources. However everyone's demo is about the same thing shown in the docs which is the person name finder (which happens to include a multi-word token - Pierre Vinken)! As far as the training data is concerned i've systematically checked it!!! I had to because as i said in the beginning i had nested tags which had to be sorted manually...I spent a whole day doing that but at least i was thinking "Finally i am so close to training...!!!". On top of that if there was something wrong with my training data i would expect exceptions again but i 'm not getting any since i sorted out the nested tags !!!
The Name Finder takes a tokenized sentence at a time. After you are done
with a document
you should clear the adaptive data.
In order to avoid doing that i have merged all 383 papers into a single one with "cat *.txt -> merged.txt" and i'm treating it as a single document...
Is that a problem? I don't see how it could be...

Regards,
Jim




On 08/02/12 15:29, Joern Kottmann wrote:
On Mon, Feb 6, 2012 at 6:15 PM, Jim - FooBar();<jimpil1...@gmail.com>wrote:

Now the easy and straightforward solution to that is to NOT add spaces in
the tag like this :

<START:drug>Folic acid<END>    (this will NOT match "acid" in later parsing)

I honestly wasn't expecting that to make any difference to the training
process but as it turns out it breaks it completely.


That is how we defined the training format. The<START>  and<END>  tag MUST
be white space separated,
otherwise it will be recognized as a token.

Jörn


Reply via email to