Dear list,
I would like to process metadata from publication repositories into a
nutch index.
The metadata comes as xml (OAI_PMH to be more precise).
The starting URLs look like
http://oai_host/servlet?method=getRecords&set=someSet
Theses requests return lists,
which basically look like
<list>
<item>
<id>32423</id>
<content>very long desciption1, e.g. an abstract</content>
<url>http://somewhere.com/somedoc1.pdf</url>
</item>
<item>
<id>12441</id>
<content>very long desciption2, e.g. an abstract</content>
<url>http://somewhereelse.it/somedoc2.pdf</url>
</item>
</list>
My initial idea was to utilize the Parser-Extension-Point
and provide a plugin which works the same way the rss-parser does:
return all outlinks to the detailed view forms
- e.g.
http://oai_host/servlet?method=getSingleRecord&id=_value_of_id-element_ -
and skip the content of the list.
Following these links would return documents with one item only.
Is it possible to store these documents with the url from the
<url>-element instead of the "real" url (i.e. the servlet-uri used for
the request)?
Would this work out? Can you suggest a better approach?
Anyway, refetching all single hits is pretty much a waste,
as all information is already included in the list.
Any comments on that?
Help would be very much appreaciated,
Best regards
Karsten