Re: The Nutch Crawler and the Web Link Graph

Andrzej Bialecki Tue, 15 Aug 2006 07:01:30 -0700

John Casey wrote:

Hi All, is there any way to extract the outlinks of particularwebpage/URL?I have had a look the LinkDBReader but this will only give me alisting of
pages that link to the page in question. Any ideas ? I have been having a

Nutch doesn't store this information in a single central place, ratherit's spread inside each segment.

look in the segments directory and have been trying to read/parse thefiles

using Hadoops SequenceFile.Reader but haven't had much luck getting the
format right. Is there any documentation on this? My intuition tells that
nutch probably does store the outlinks of a URL somewhere but its hard to
tell where.

Your intuition is correct - inside a fetched and parsed segment there isa map file called parse_data, which contains tuples <url, Parse>. Insideeach Parse.parseData you can find a list of outlinks.

(Actually, if you ran your fetcher in non-local mode, using severalreduce tasks, you will end up with several part-xxxxx map files. You canread this using a MapFileReader - run it wthout args to see the synopsis(note: this class is only available here:http://issues.apache.org/jira/browse/HADOOP-175)


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The Nutch Crawler and the Web Link Graph

Reply via email to