Simple solution would be done the segments using following command and just write a script which can extract the Outlinks present in the documents of the segment.
$NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent -nofetch -nogenerate -noparse -noparsetext this will give you a dump file. run a script and you will get outlinks. On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote: > > Hi @ all, > > I started to write my own plugin. I extended the HtmlParserFilter to grab > outlinks to other pages, but it looks like that the outlinks are just links > to css or js files, or am I wrong? What is the best way to extract all > outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will find > my > code below... > > public class ComputerScienceConferenceHtmlParser implements HtmlParseFilter > { > > private static final Log LOG = > LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName()); > > private Configuration conf; > > public Parse filter(Content content, Parse parse, HTMLMetaTags > metaTags, > DocumentFragment doc) { > > ParseData parseData = parse.getData(); > Outlink[] outlinks = parseData.getOutlinks(); > > String text = parse.getText(); > > LOG.info("ComputerScienceConferenceHtmlParser: " + text); > > LOG.warn("BEFORE"); > for(int i=0; i<outlinks.length; i++) { > LOG.warn("Content Base URL: " + > content.getBaseUrl()); > LOG.warn("Outlink Anchor: " + > outlinks[i].getAnchor()); > LOG.warn("Outlinks ToURL: " + > outlinks[i].getToUrl()); > LOG.warn("Outlinks toString(): " + > outlinks[i].toString()); > LOG.warn("metaTags: " + > metaTags.getBaseHref().toString()); > } > LOG.warn("AFTER"); > > return parse; > } > > > public void setConf(Configuration conf) { > this.conf = conf; > } > > public Configuration getConf() { > return this.conf; > } > } > -- > View this message in context: > http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >