Re: [Nutch-general] parsing and using xml-data

Stefan Groschupf Thu, 08 Jun 2006 16:12:54 -0700

Hi Karsten,

nutch has the limitation one url one document (in crawlDB or index).
The content and metadata for this document is normally available  
'behind' url. The only exception is the anchor text. Anchor text are  
data from the "mother" url that is passed and indexed within the  
"child" document.
So you can hack something and try to use the anchor text as data  
container, but not sure if that will solve your problem.


I suggest to extract the links from your starting urls and try to get  
the content from the detail pages.
Not sure if that will help you.
Stefan


Am 08.06.2006 um 21:42 schrieb Karsten Dello:

> Dear list,
>
> I would like to process metadata from publication repositories into  
> a nutch index.
> The metadata comes as xml (OAI_PMH to be more precise).
>
> The starting URLs look like
>
> http://oai_host/servlet?method=getRecords&set=someSet
>
> Theses requests return lists,
> which basically look like
>
> <list>
> <item>
>       <id>32423</id>
>       <content>very long desciption1, e.g. an abstract</content>
>       <url>http://somewhere.com/somedoc1.pdf</url>
> </item>
>
> <item>
>       <id>12441</id>
>       <content>very long desciption2, e.g. an abstract</content>
>       <url>http://somewhereelse.it/somedoc2.pdf</url>
> </item>
>
> </list>
>
> My initial idea was to utilize the Parser-Extension-Point
> and provide a plugin which works the same way the rss-parser does:
> return all outlinks to the detailed view forms
> - e.g. http://oai_host/servlet? 
> method=getSingleRecord&id=_value_of_id-element_ -
> and skip the content of the list.
>
> Following these links would return documents with one item only.
> Is it possible to store these documents with the url from the <url>- 
> element instead of the "real" url (i.e. the servlet-uri used for  
> the request)?
>
> Would this work out? Can you suggest a better approach?
>
> Anyway, refetching all single hits is pretty much a waste,
> as all information is already included in the list.
> Any comments on that?
>
>
> Help would be very much appreaciated,
>
> Best regards
>
> Karsten
>
>
>



_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] parsing and using xml-data

Reply via email to