Is anyone here using Nutch for crawling digital scholarly archives?
If so, are you also harvesting and indexing additional metadata?
My group (http://www.patacriticism.org) is considering using Nutch to
crawl a specific set of sites and index the HTML as full-text and
also retrieve any associated RDF data (specified with a hyperlink in
a <meta> tag perhaps, like this page: http://www.rossettiarchive.org/
docs/1-1847.s244.raw.html). The RDF most likely could be simply
indexed as additional fields, but perhaps it would also be added to
an RDF engine (such as Kowari) and perhaps additionally queried in
the search interface in conjunction with full-text searching.
The Ontology and Creative Commons plugins are great starting places,
for sure. I'm wondering what others have done along these lines.
Thanks,
Erik