Re: [Nutch-general] The Nutch Crawler and the Web Link Graph

Andrzej Bialecki Tue, 15 Aug 2006 07:02:17 -0700

John Casey wrote:
> Hi All, is there any way to extract the outlinks of particular 
> webpage/URL?
> I have had a look the LinkDBReader but this will only give me a 
> listing of
> pages that link to the page in question. Any ideas ? I have been having a


Nutch doesn't store this information in a single central place, rather 
it's spread inside each segment.

> look in the segments directory and have been trying to read/parse the 
> files
> using Hadoops SequenceFile.Reader but haven't had much luck getting the
> format right. Is there any documentation on this? My intuition tells that
> nutch probably does store the outlinks of a URL somewhere but its hard to
> tell where.

Your intuition is correct - inside a fetched and parsed segment there is 
a map file called parse_data, which contains tuples <url, Parse>. Inside 
each Parse.parseData you can find a list of outlinks.

(Actually, if you ran your fetcher in non-local mode, using several 
reduce tasks, you will end up with several part-xxxxx map files. You can 
read this using a MapFileReader - run it wthout args to see the synopsis 
(note: this class is only available here: 
http://issues.apache.org/jira/browse/HADOOP-175)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] The Nutch Crawler and the Web Link Graph

Reply via email to