thanks it works perfectly although I did end up merging the segments rather than using your MapFileReader.
On 8/15/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
John Casey wrote: > Hi All, is there any way to extract the outlinks of particular > webpage/URL? > I have had a look the LinkDBReader but this will only give me a > listing of > pages that link to the page in question. Any ideas ? I have been having a Nutch doesn't store this information in a single central place, rather it's spread inside each segment. > look in the segments directory and have been trying to read/parse the > files > using Hadoops SequenceFile.Reader but haven't had much luck getting the > format right. Is there any documentation on this? My intuition tells that > nutch probably does store the outlinks of a URL somewhere but its hard to > tell where. Your intuition is correct - inside a fetched and parsed segment there is a map file called parse_data, which contains tuples <url, Parse>. Inside each Parse.parseData you can find a list of outlinks. (Actually, if you ran your fetcher in non-local mode, using several reduce tasks, you will end up with several part-xxxxx map files. You can read this using a MapFileReader - run it wthout args to see the synopsis (note: this class is only available here: http://issues.apache.org/jira/browse/HADOOP-175) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
