Re: [CODE4LIB] wikipedia/author disambiguation

Jonathan Rochkind Thu, 19 May 2011 07:42:17 -0700

In addition to the approaches you note, might be worth investigatingthis tool that came up in a thread just a few days ago on this list:


http://wikipedia-miner.sourceforge.net/

I think nobody's done enough with this yet to be sure what will workbest, I think you're going to have to experiment and let us know.

VIAF/OCLC services are presumably using some sort of statisticalanalysis/text mining approaches under the hood; wikipedia-miner is usingsuch approaches but giving you the code in open source too if you'recurious exactly what they're doing. I suspect statistical approacheslike wikipedia-miner uses are likely to be more productive than pure"parsing" approaches considering only one record at a time inisolation. But writing your own statistics analysis algorithms isprobably more work than you want, especially when wikipedia-miner and/orVIAF/OCLC services already exist.

If you don't do statistical analysis of the corpus, and do end upactually trying to search wikipedia directly -- then I suspect dbpediais a lot more convenient endpoint than trying to screen-scrape HTMLwikipedia. That's pretty much what dbpedia is for.


But these are all just my guesses, not informed by any work I've done.

Jonathan


On 5/19/2011 5:40 AM, graham wrote:

I need to be able to take author data from a catalogue record and use it
to look up the author on Wikipedia on the fly. So I may have birth date
and possibly year of death in addition to (one spelling of) the name,
the title of one book the author wrote etc.

I know there are various efforts in progress that will improve the
current situation, but as things stand at the moment what is the best*
way to do this?

1. query wikipedia for as much as possible, parse and select the best
fitting result

2. go via dbpedia/freebase and work back from there

3. use VIAF and/or OCLC services

4. Other?

(I have no experience of 2-4 yet :-(


Thanks
Graham
* 'best' being constrained by:
- need to do this in real-time
- need to avoid dependence on services which may be taken away
or charged for
- being able to justify to librarians as reasonably accurate :-)

Re: [CODE4LIB] wikipedia/author disambiguation

Reply via email to