Hi Karsten, nutch has the limitation one url one document (in crawlDB or index). The content and metadata for this document is normally available 'behind' url. The only exception is the anchor text. Anchor text are data from the "mother" url that is passed and indexed within the "child" document. So you can hack something and try to use the anchor text as data container, but not sure if that will solve your problem.
I suggest to extract the links from your starting urls and try to get the content from the detail pages. Not sure if that will help you. Stefan Am 08.06.2006 um 21:42 schrieb Karsten Dello: > Dear list, > > I would like to process metadata from publication repositories into > a nutch index. > The metadata comes as xml (OAI_PMH to be more precise). > > The starting URLs look like > > http://oai_host/servlet?method=getRecords&set=someSet > > Theses requests return lists, > which basically look like > > <list> > <item> > <id>32423</id> > <content>very long desciption1, e.g. an abstract</content> > <url>http://somewhere.com/somedoc1.pdf</url> > </item> > > <item> > <id>12441</id> > <content>very long desciption2, e.g. an abstract</content> > <url>http://somewhereelse.it/somedoc2.pdf</url> > </item> > > </list> > > My initial idea was to utilize the Parser-Extension-Point > and provide a plugin which works the same way the rss-parser does: > return all outlinks to the detailed view forms > - e.g. http://oai_host/servlet? > method=getSingleRecord&id=_value_of_id-element_ - > and skip the content of the list. > > Following these links would return documents with one item only. > Is it possible to store these documents with the url from the <url>- > element instead of the "real" url (i.e. the servlet-uri used for > the request)? > > Would this work out? Can you suggest a better approach? > > Anyway, refetching all single hits is pretty much a waste, > as all information is already included in the list. > Any comments on that? > > > Help would be very much appreaciated, > > Best regards > > Karsten > > > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
