Brian Higgins wrote: > Hi, > i'm pretty new to Nutch and i'm trying to modify the code so it stores > the > words before and after a hyperlink as well as the anchor text. > i've ben looking through the nutch code for a couple of days and i'm > still a > little unclear as to the layout... > Nutch parses incoming webpages in HTMLParser.java right? i can't seem to > find the code in here for url processing though - where exactly does it > parse the anchor text and write it to the database?
It collects outlinks in DOMContentUtils.getOutlinks. You will need to get the preceding sibling nodes, or a parent node, to collect more of the surrounding text. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
