Re: The Nutch Crawler and the Web Link Graph

John Casey Wed, 16 Aug 2006 18:19:13 -0700

thanks it works perfectly although I did end up merging the segments rather
than using your MapFileReader.


On 8/15/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

John Casey wrote:
> Hi All, is there any way to extract the outlinks of particular
> webpage/URL?
> I have had a look the LinkDBReader but this will only give me a
> listing of
> pages that link to the page in question. Any ideas ? I have been having
a

Nutch doesn't store this information in a single central place, rather
it's spread inside each segment.

> look in the segments directory and have been trying to read/parse the
> files
> using Hadoops SequenceFile.Reader but haven't had much luck getting the
> format right. Is there any documentation on this? My intuition tells
that
> nutch probably does store the outlinks of a URL somewhere but its hard
to
> tell where.

Your intuition is correct - inside a fetched and parsed segment there is
a map file called parse_data, which contains tuples <url, Parse>. Inside
each Parse.parseData you can find a list of outlinks.

(Actually, if you ran your fetcher in non-local mode, using several
reduce tasks, you will end up with several part-xxxxx map files. You can
read this using a MapFileReader - run it wthout args to see the synopsis
(note: this class is only available here:
http://issues.apache.org/jira/browse/HADOOP-175)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: The Nutch Crawler and the Web Link Graph

Reply via email to