Thank you for the hint. How can this be done with the Segment Reader (Nutch 0.9 api)? Thanks in advance.
Cheers, MyD vishal vachhani wrote: > > Simple solution would be done the segments using following command and > just > write a script which can extract the Outlinks present in the documents of > the segment. > > $NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent -nofetch > -nogenerate -noparse -noparsetext > > this will give you a dump file. run a script and you will get outlinks. > > On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote: > >> >> Hi @ all, >> >> I started to write my own plugin. I extended the HtmlParserFilter to grab >> outlinks to other pages, but it looks like that the outlinks are just >> links >> to css or js files, or am I wrong? What is the best way to extract all >> outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will find >> my >> code below... >> >> public class ComputerScienceConferenceHtmlParser implements >> HtmlParseFilter >> { >> >> private static final Log LOG = >> LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName()); >> >> private Configuration conf; >> >> public Parse filter(Content content, Parse parse, HTMLMetaTags >> metaTags, >> DocumentFragment doc) { >> >> ParseData parseData = parse.getData(); >> Outlink[] outlinks = parseData.getOutlinks(); >> >> String text = parse.getText(); >> >> LOG.info("ComputerScienceConferenceHtmlParser: " + text); >> >> LOG.warn("BEFORE"); >> for(int i=0; i<outlinks.length; i++) { >> LOG.warn("Content Base URL: " + >> content.getBaseUrl()); >> LOG.warn("Outlink Anchor: " + >> outlinks[i].getAnchor()); >> LOG.warn("Outlinks ToURL: " + >> outlinks[i].getToUrl()); >> LOG.warn("Outlinks toString(): " + >> outlinks[i].toString()); >> LOG.warn("metaTags: " + >> metaTags.getBaseHref().toString()); >> } >> LOG.warn("AFTER"); >> >> return parse; >> } >> >> >> public void setConf(Configuration conf) { >> this.conf = conf; >> } >> >> public Configuration getConf() { >> return this.conf; >> } >> } >> -- >> View this message in context: >> http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Pulling-out-URLs-tp22469643p22474780.html Sent from the Nutch - User mailing list archive at Nabble.com.