John Casey wrote:
Hi All, is there any way to extract the outlinks of particular
webpage/URL?
I have had a look the LinkDBReader but this will only give me a
listing of
pages that link to the page in question. Any ideas ? I have been having a
Nutch doesn't store this information in a single central place, rather
it's spread inside each segment.
look in the segments directory and have been trying to read/parse the
files
using Hadoops SequenceFile.Reader but haven't had much luck getting the
format right. Is there any documentation on this? My intuition tells that
nutch probably does store the outlinks of a URL somewhere but its hard to
tell where.
Your intuition is correct - inside a fetched and parsed segment there is
a map file called parse_data, which contains tuples <url, Parse>. Inside
each Parse.parseData you can find a list of outlinks.
(Actually, if you ran your fetcher in non-local mode, using several
reduce tasks, you will end up with several part-xxxxx map files. You can
read this using a MapFileReader - run it wthout args to see the synopsis
(note: this class is only available here:
http://issues.apache.org/jira/browse/HADOOP-175)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com