On Wed, Feb 8, 2012 at 4:45 PM, Jim - FooBar(); <jimpil1...@gmail.com>wrote:
> That is how we defined the training format. The<START> and<END> tag MUST >> be white space separated, >> otherwise it will be recognized as a token. >> > I think you meant it *WON'T* be recognised as a token...a mere typo or > have i misunderstood? > I mean its pretty obvious that spaces are necessary from all the > exceptions thrown while counting the events! I'm just wondering why have > you chosen to do that in the first place?No spaces makes a lot more sense > to me regardless of the problem i am having with regex replacement... > The parsing code for the format expects white space tokenized text. The <START> and <END> tags are handled different and are not a token in this sense, but when you directly attach it to a word like you did. acid<START> then our parsing code just recognize it as a token and not the tag to mark entity boundaries. > > > Have a look at our documentation. The NER code you see there is correct. >> If you have problems to detect multi-token names I suspect that something >> with your training data is wrong. >> > I've spent the last 2 weeks reading the docs and i 've practically read > all the external sources on stackoverflow and otehr sources. However > everyone's demo is about the same thing shown in the docs which is the > person name finder (which happens to include a multi-word token - Pierre > Vinken)! As far as the training data is concerned i've systematically > checked it!!! I had to because as i said in the beginning i had nested tags > which had to be sorted manually...I spent a whole day doing that but at > least i was thinking "Finally i am so close to training...!!!". On top of > that if there was something wrong with my training data i would expect > exceptions again but i 'm not getting any since i sorted out the nested > tags !!! You are welcome to send us patches for problems in our training data parsing code, to my knowledge it just works as long as the data is in the correct format. Format violations might he hard to find, that is true, we already improved it a bit (I think that is already in 1.5.2). > The Name Finder takes a tokenized sentence at a time. After you are done >> with a document >> you should clear the adaptive data. >> > In order to avoid doing that i have merged all 383 papers into a single > one with "cat *.txt -> merged.txt" and i'm treating it as a single > document... > Is that a problem? I don't see how it could be... > > Yes, it costs you a little recall and precision because the previous map feature does not work this way. Empty lines are used in the training data to indicate document boundaries. And you need to put each sentence into one line. A line a i used to indicate a sentence boundary. Anyway I still believe something with you training data is wrong. Would it be possible to have a look at one of these papers? Or a few sentences? Do you pass a sentence at a time to the name finder? Jörn