> Thanks for the feedback ! Muddy boots is cool...
TheyWorkForYou.com adds links to Hansard by matching Proper Names with Wikipedia entries. http://www.theyworkforyou.com/debates/?id=2007-11-21a.1190.1 The number false positives is acceptable and the wikipedia links are miles better than the user-generated glossary with which the site was launched. But it's still limited since it only parses for Capitalised Phrases or ACRONYMS. Shifting to term extraction seemed an obvious route, but as I think Muddy Boots shows, term extraction tends to throw up unacceptably large number of 'false positive' terms- these result in crappy random links and are user experience poison. However, you can minimise "false positive" terms by running the copy through several different flavours of term extractor, and only using terms thrown up by x or more of them (where x depends on your appetite for false positives vs false negatives). So, why not throw the copy through several more term extractors then only use the overlapping terms? - The BBC has at least one *excellent* term extractor in house which adds extra metadata like 'this term is a person/place/topic'... would be a lovely API to offer, hint hint... - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/[email protected]/

