all-- been sitting on this "term extraction" topic (no pun...) for over a month now, and i've got a more extensive treatise brewing, but not finished...
so, meanwhile, a couple of things to mention in this area... 1) tom loosemore: "So, why not throw the copy through several more term extractors then > only use the overlapping terms?" Rhys: "Though I'm uneasy about a possible situation where one of your term extractors comes up with a great set of terms, but the others miss them completely, and so your output is a bad compromise of terms that aren't that meaningful." i've personally explored this approach somewhat thoroughly over the past few years, at work and at, um, play, and feel it's really effective -- in practice, i haven't come across a situation where "your output is a bad compromise of terms that aren't that meaningful..." -- tho i suppose that depends on the particular use cases you apply it to... i'll post a little code/prototype app that illustrates this approach for people to poke at soon... 2) here's something i've been exploring and would like to suggest others try, to see if you agree it's promising: download wikipedia dump... index it into Lucene, one Lucene doc per wikipedia page/concept/URI... compare your own (text) content to that Wikipedia-in-Lucene collection, using Lucene's MoreLikeThis... MoreLikeThis suggests wikipedia articles "similar" to your content... let the "term extraction-like, but with unique, semantic web-ready unique ID/URI hijinks" begin... again, i should have some (nasty) code/prototype web app available for comment/debunking soon... 3) "The BBC has at least one *excellent* term extractor in house which > adds extra metadata like 'this term is a person/place/topic'... would > be a lovely API to offer, hint hint... Ah - has this been used to derive the > subject categories and contributors for the web version of Infax, by any chance? If so, and even if not, that would be a gorgeous API to offer - please, BBC..." agree that the Beeb should try to make this into a public-facing API! 4) i agree that http://sws.clearforest.com/ws is really good and useful... anyone made any progress with GATE/ANNIE tho? how about LingPipe? what about the new-ish Yahoo! Pipes entity extraction? 5) in this term extraction/semantic web space, this could be REALLY big, check it out and let us know what you make of it: Calais - Overview Calais: Connect. Everything We want to make all the world's content more accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the semantic web or the Giant Global Graph - we call our piece of it Calais. http://reuters.mashery.com/ insanely useful? thoughts? best-- --cs -----Original Message----- From: [EMAIL PROTECTED] on behalf of Rhys Jones Sent: Tue 11/27/2007 11:09 AM To: backstage@lists.bbc.co.uk Subject: Re: [backstage] Muddy Boots on Backstage On 26/11/2007, Tom Loosemore <[EMAIL PROTECTED]> wrote: > ...you can minimise "false positive" terms by running the copy > through several different flavours of term extractor, and only using > terms thrown up by x or more of them (where x depends on your appetite > for false positives vs false negatives). > > So, why not throw the copy through several more term extractors then > only use the overlapping terms? This should work (and it's been suggested on the backstage-dev list recently). Though I'm uneasy about a possible situation where one of your term extractors comes up with a great set of terms, but the others miss them completely, and so your output is a bad compromise of terms that aren't that meaningful. Do any APIs let you see the confidence score on their output terms? Having admittedly not thought about this much, it seems to me that a confidence score is key to any realistic combination algorithm. In terms (sorry) of quality of output, people seem to like Yahoo's API. I've come across Trynt's offering too (http://www.trynt.com/trynt-contextual-term-extraction-api/ ), but ominously their website is giving me a 403 Forbidden error right now. http://www.programmableweb.com/api/clearforest-semantic-web-services1/ has also been suggested on the "pure technical discussion" list. > - The BBC has at least one *excellent* term extractor in house which > adds extra metadata like 'this term is a person/place/topic'... would > be a lovely API to offer, hint hint... Ah - has this been used to derive the subject categories and contributors for the web version of Infax, by any chance? If so, and even if not, that would be a gorgeous API to offer - please, BBC... Rhys - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/