Hi, +1 for the baseline first. I was also thinking of setting a priority for each type. For example setting a higher priority to types which have a higher F1. But I like the probabilities from model suggested by Jörn too.
William On Tue, Apr 17, 2012 at 10:03 AM, Jörn Kottmann <[email protected]> wrote: > I propose that we make a simple baseline implementations > which takes all output spans, orders them and then resolves > the ambiguities based on the order. This will prefer longer > names over shorter names, but ignores the type. > > There are more sophisticated ways of handling this, > e.g taking probabilities from the statistical name finders into > account, but these might be a bit more restrictive as well. > > Its always good to have some simple baseline, to see how much > something more complicated improves it. > > Any opinions? > > Jörn > > > On 04/17/2012 02:52 PM, Jörn Kottmann wrote: > >> If you don't want to handle these cases, you can simply copy all names >> together >> into a list, and then do evaluation on this list. >> This approach works with our evaluation, but will usually be an issue for >> applications which expect output >> where the ambiguities mentioned earlier are resolved. >> >> Jörn >> >> On 04/17/2012 02:38 PM, Jim - FooBar(); wrote: >> >>> Ok first of all you're referring to the final merging >>> (AggregateNameFinder) and not the multiple dictionaries where no merging >>> occurs...anyway let's deal with this at the moment. let's see... >>> >>>> - Two names can be identical and have the same type or a different type >>>> >>> Well if the type is different the spans are not identical (equal) so >>> you keep both and do some reasoning over them (see below). >>> If they type is the same and the spans cover the same text then they are >>> equal so you only keep one of them. >>> >>>> - Two names have intersecting spans >>>> >>> It is very unlikely that both are correct so in the simplest case of >>> keeping them both you may lose some precision. However considering how >>> often that could happen it becomes unimportant. Or you could do some >>> reasoning (see below) again if they have the same type. If they don't have >>> the same type then why not keep them both again? >>> >>> - One name is contained in another like this: >>>> <START:A> a b<START:B> c<END:B> d<END:A> >>>> >>> well, this is exactly the same case as before conceptually. If they have >>> the same type it's very likely that one is wrong.You can do the same sort >>> of reasoning as above. If they don't there is no way to know with >>> confidence what to do so i say keep them both. >>> >>> the reasoning i'm referring to is simply to *trust the dictionary* (if >>> one exists). If one doesn't exist and one is trying to merge results from >>> several maxent models for example, then we cannot make an informed >>> decision. It is only the dictionary that can provide facts. all the rest >>> are probabilities... >>> >>> Jim >>> >>> >>> Hi all, >>>> >>>> in one of the jiras we started a discussion about merging the output >>>> of multiple name finders and which conflicts exist. >>>> Lets move it back to the dev list. >>>> >>>> The merging code needs to handle these cases: >>>> >>>> - Two names can be identical and have the same type or a different type. >>>> >>>> - Two names have intersecting spans like this: >>>> <START:A> a b<START:B> c<END:A> d<END:B> >>>> >>>> - One name is contained in another like this: >>>> <START:A> a b<START:B> c<END:B> d<END:A> >>>> >>>> Depending on the use case and merging logic it might be resolved >>>> differently. >>>> >>>> Jörn >>>> >>> >>> >>> >> >
