Dear list, I would like to process metadata from publication repositories into a nutch index. The metadata comes as xml (OAI_PMH to be more precise).
The starting URLs look like http://oai_host/servlet?method=getRecords&set=someSet Theses requests return lists, which basically look like <list> <item> <id>32423</id> <content>very long desciption1, e.g. an abstract</content> <url>http://somewhere.com/somedoc1.pdf</url> </item> <item> <id>12441</id> <content>very long desciption2, e.g. an abstract</content> <url>http://somewhereelse.it/somedoc2.pdf</url> </item> </list> My initial idea was to utilize the Parser-Extension-Point and provide a plugin which works the same way the rss-parser does: return all outlinks to the detailed view forms - e.g. http://oai_host/servlet?method=getSingleRecord&id=_value_of_id-element_ - and skip the content of the list. Following these links would return documents with one item only. Is it possible to store these documents with the url from the <url>-element instead of the "real" url (i.e. the servlet-uri used for the request)? Would this work out? Can you suggest a better approach? Anyway, refetching all single hits is pretty much a waste, as all information is already included in the list. Any comments on that? Help would be very much appreaciated, Best regards Karsten _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
