all--

been sitting on this "term extraction" topic (no pun...) for over a month now, 
and i've got a more extensive treatise brewing, but not finished... 

so, meanwhile, a couple of things to mention in this area...


1) tom loosemore: "So, why not throw the copy through several more term 
extractors then
> only use the overlapping terms?"

Rhys: "Though I'm uneasy about a possible situation where one of
your term extractors comes up with a great set of terms, but the
others miss them completely, and so your output is a bad compromise of
terms that aren't that meaningful."


i've personally explored this approach somewhat thoroughly over the past few 
years, at work and at, um, play, and feel it's really effective -- in practice, 
i haven't come across a situation where "your output is a bad compromise of 
terms that aren't that meaningful..." -- tho i suppose that depends on the 
particular use cases you apply it to... i'll post a little code/prototype app 
that illustrates this approach for people to poke at soon...



2) here's something i've been exploring and would like to suggest others try, 
to see if you agree it's promising: download wikipedia dump... index it into 
Lucene, one Lucene doc per wikipedia page/concept/URI... compare your own 
(text) content to that Wikipedia-in-Lucene collection, using Lucene's 
MoreLikeThis... MoreLikeThis suggests wikipedia articles "similar" to your 
content... let the "term extraction-like, but with unique, semantic web-ready 
unique ID/URI hijinks" begin... again, i should have some (nasty) 
code/prototype web app available for comment/debunking soon...



3) "The BBC has at least one *excellent* term extractor in house which
> adds extra metadata like 'this term is a person/place/topic'... would
> be a lovely API to offer, hint hint... Ah - has this been used to derive the 
> subject categories and
contributors for the web version of Infax, by any chance? If so, and
even if not, that would be a gorgeous API to offer - please, BBC..."

agree that the Beeb should try to make this into a public-facing API! 



4) i agree that http://sws.clearforest.com/ws is really good and useful... 
anyone made any progress with GATE/ANNIE tho? how about LingPipe? what about 
the new-ish Yahoo! Pipes entity extraction?



5) in this term extraction/semantic web space, this could be REALLY big, check 
it out and let us know what you make of it:

Calais - Overview

Calais: Connect. Everything We want to make all the world's content more 
accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the 
semantic web or the Giant Global Graph - we call our piece of it Calais. 

http://reuters.mashery.com/

insanely useful? thoughts?




best--

--cs





-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Rhys Jones
Sent: Tue 11/27/2007 11:09 AM
To: backstage@lists.bbc.co.uk
Subject: Re: [backstage] Muddy Boots on Backstage
 
On 26/11/2007, Tom Loosemore <[EMAIL PROTECTED]> wrote:

> ...you can minimise "false positive" terms by running the copy
> through several different flavours of term extractor, and only using
> terms thrown up by x or more of them (where x depends on your appetite
> for false positives vs false negatives).
>
> So, why not throw the copy through several more term extractors then
> only use the overlapping terms?

This should work (and it's been suggested on the backstage-dev list
recently). Though I'm uneasy about a possible situation where one of
your term extractors comes up with a great set of terms, but the
others miss them completely, and so your output is a bad compromise of
terms that aren't that meaningful.

Do any APIs let you see the confidence score on their output terms?
Having admittedly not thought about this much, it seems to me that a
confidence score is key to any realistic combination algorithm.

In terms (sorry) of quality of output, people seem to like Yahoo's
API. I've come across Trynt's offering too
(http://www.trynt.com/trynt-contextual-term-extraction-api/ ), but
ominously their website is giving me a 403 Forbidden error right now.
http://www.programmableweb.com/api/clearforest-semantic-web-services1/
has also been suggested on the "pure technical discussion" list.

> - The BBC has at least one *excellent* term extractor in house which
> adds extra metadata like 'this term is a person/place/topic'... would
> be a lovely API to offer, hint hint...

Ah - has this been used to derive the subject categories and
contributors for the web version of Infax, by any chance? If so, and
even if not, that would be a gorgeous API to offer - please, BBC...

Rhys
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/

Reply via email to