thanks it works perfectly although I did end up merging the segments rather
than using your MapFileReader.
On 8/15/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
John Casey wrote:
> Hi All, is there any way to extract the outlinks of particular
> webpage/URL?
> I have had a look the LinkDBReader but this will only give me a
> listing of
> pages that link to the page in question. Any ideas ? I have been having
a
Nutch doesn't store this information in a single central place, rather
it's spread inside each segment.
> look in the segments directory and have been trying to read/parse the
> files
> using Hadoops SequenceFile.Reader but haven't had much luck getting the
> format right. Is there any documentation on this? My intuition tells
that
> nutch probably does store the outlinks of a URL somewhere but its hard
to
> tell where.
Your intuition is correct - inside a fetched and parsed segment there is
a map file called parse_data, which contains tuples <url, Parse>. Inside
each Parse.parseData you can find a list of outlinks.
(Actually, if you ran your fetcher in non-local mode, using several
reduce tasks, you will end up with several part-xxxxx map files. You can
read this using a MapFileReader - run it wthout args to see the synopsis
(note: this class is only available here:
http://issues.apache.org/jira/browse/HADOOP-175)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general