Dear list,

I would like to process metadata from publication repositories into a 
nutch index.
The metadata comes as xml (OAI_PMH to be more precise).

The starting URLs look like

http://oai_host/servlet?method=getRecords&set=someSet

Theses requests return lists,
which basically look like

<list>
<item>
        <id>32423</id>
        <content>very long desciption1, e.g. an abstract</content>
        <url>http://somewhere.com/somedoc1.pdf</url>
</item>

<item>
        <id>12441</id>
        <content>very long desciption2, e.g. an abstract</content>
        <url>http://somewhereelse.it/somedoc2.pdf</url>
</item>

</list>

My initial idea was to utilize the Parser-Extension-Point
and provide a plugin which works the same way the rss-parser does:
return all outlinks to the detailed view forms
- e.g. 
http://oai_host/servlet?method=getSingleRecord&id=_value_of_id-element_ -
and skip the content of the list.

Following these links would return documents with one item only.
Is it possible to store these documents with the url from the 
<url>-element instead of the "real" url (i.e. the servlet-uri used for 
the request)?

Would this work out? Can you suggest a better approach?

Anyway, refetching all single hits is pretty much a waste,
as all information is already included in the list.
Any comments on that?


Help would be very much appreaciated,

Best regards

Karsten




_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to