John Casey wrote: > Hi All, is there any way to extract the outlinks of particular > webpage/URL? > I have had a look the LinkDBReader but this will only give me a > listing of > pages that link to the page in question. Any ideas ? I have been having a
Nutch doesn't store this information in a single central place, rather it's spread inside each segment. > look in the segments directory and have been trying to read/parse the > files > using Hadoops SequenceFile.Reader but haven't had much luck getting the > format right. Is there any documentation on this? My intuition tells that > nutch probably does store the outlinks of a URL somewhere but its hard to > tell where. Your intuition is correct - inside a fetched and parsed segment there is a map file called parse_data, which contains tuples <url, Parse>. Inside each Parse.parseData you can find a list of outlinks. (Actually, if you ran your fetcher in non-local mode, using several reduce tasks, you will end up with several part-xxxxx map files. You can read this using a MapFileReader - run it wthout args to see the synopsis (note: this class is only available here: http://issues.apache.org/jira/browse/HADOOP-175) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
