On 26/11/2007, Tom Loosemore <[EMAIL PROTECTED]> wrote: > ...you can minimise "false positive" terms by running the copy > through several different flavours of term extractor, and only using > terms thrown up by x or more of them (where x depends on your appetite > for false positives vs false negatives). > > So, why not throw the copy through several more term extractors then > only use the overlapping terms?
This should work (and it's been suggested on the backstage-dev list recently). Though I'm uneasy about a possible situation where one of your term extractors comes up with a great set of terms, but the others miss them completely, and so your output is a bad compromise of terms that aren't that meaningful. Do any APIs let you see the confidence score on their output terms? Having admittedly not thought about this much, it seems to me that a confidence score is key to any realistic combination algorithm. In terms (sorry) of quality of output, people seem to like Yahoo's API. I've come across Trynt's offering too (http://www.trynt.com/trynt-contextual-term-extraction-api/ ), but ominously their website is giving me a 403 Forbidden error right now. http://www.programmableweb.com/api/clearforest-semantic-web-services1/ has also been suggested on the "pure technical discussion" list. > - The BBC has at least one *excellent* term extractor in house which > adds extra metadata like 'this term is a person/place/topic'... would > be a lovely API to offer, hint hint... Ah - has this been used to derive the subject categories and contributors for the web version of Infax, by any chance? If so, and even if not, that would be a gorgeous API to offer - please, BBC... Rhys - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/[email protected]/

