In addition to the approaches you note, might be worth investigating
this tool that came up in a thread just a few days ago on this list:
http://wikipedia-miner.sourceforge.net/
I think nobody's done enough with this yet to be sure what will work
best, I think you're going to have to experiment and let us know.
VIAF/OCLC services are presumably using some sort of statistical
analysis/text mining approaches under the hood; wikipedia-miner is using
such approaches but giving you the code in open source too if you're
curious exactly what they're doing. I suspect statistical approaches
like wikipedia-miner uses are likely to be more productive than pure
"parsing" approaches considering only one record at a time in
isolation. But writing your own statistics analysis algorithms is
probably more work than you want, especially when wikipedia-miner and/or
VIAF/OCLC services already exist.
If you don't do statistical analysis of the corpus, and do end up
actually trying to search wikipedia directly -- then I suspect dbpedia
is a lot more convenient endpoint than trying to screen-scrape HTML
wikipedia. That's pretty much what dbpedia is for.
But these are all just my guesses, not informed by any work I've done.
Jonathan
On 5/19/2011 5:40 AM, graham wrote:
I need to be able to take author data from a catalogue record and use it
to look up the author on Wikipedia on the fly. So I may have birth date
and possibly year of death in addition to (one spelling of) the name,
the title of one book the author wrote etc.
I know there are various efforts in progress that will improve the
current situation, but as things stand at the moment what is the best*
way to do this?
1. query wikipedia for as much as possible, parse and select the best
fitting result
2. go via dbpedia/freebase and work back from there
3. use VIAF and/or OCLC services
4. Other?
(I have no experience of 2-4 yet :-(
Thanks
Graham
* 'best' being constrained by:
- need to do this in real-time
- need to avoid dependence on services which may be taken away
or charged for
- being able to justify to librarians as reasonably accurate :-)