> Thanks for the feedback !

Muddy boots is cool...

TheyWorkForYou.com adds links to Hansard by matching Proper Names with
Wikipedia entries.
http://www.theyworkforyou.com/debates/?id=2007-11-21a.1190.1

The number false positives is acceptable and the wikipedia links are
miles better than the user-generated glossary with which the site was
launched. But it's still limited since it only parses for Capitalised
Phrases or ACRONYMS.

Shifting to term extraction seemed an obvious route, but as I think
Muddy Boots shows, term extraction tends to throw up unacceptably
large number of  'false positive' terms- these result in crappy random
links and are user experience poison.

However, you can minimise "false positive" terms by running the copy
through several different flavours of term extractor, and only using
terms thrown up by x or more of them (where x depends on your appetite
for false positives vs false negatives).

So, why not throw the copy through several more term extractors then
only use the overlapping terms?

- The BBC has at least one *excellent* term extractor in house which
adds extra metadata like 'this term is a person/place/topic'... would
be a lovely API to offer, hint hint...
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/[email protected]/

Reply via email to