hi try this command bin/nutch readseg <segment_dir> <output> (i.e bin/nutch readseg ./crawldir/segments/* output.log
Regards sanjshra MyD wrote: > > Thank you for the hint. How can this be done with the Segment Reader > (Nutch 0.9 api)? Thanks in advance. > > Cheers, > MyD > > > > vishal vachhani wrote: >> >> Simple solution would be done the segments using following command and >> just >> write a script which can extract the Outlinks present in the documents of >> the segment. >> >> $NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent >> -nofetch >> -nogenerate -noparse -noparsetext >> >> this will give you a dump file. run a script and you will get outlinks. >> >> On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote: >> >>> >>> Hi @ all, >>> >>> I started to write my own plugin. I extended the HtmlParserFilter to >>> grab >>> outlinks to other pages, but it looks like that the outlinks are just >>> links >>> to css or js files, or am I wrong? What is the best way to extract all >>> outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will >>> find >>> my >>> code below... >>> >>> public class ComputerScienceConferenceHtmlParser implements >>> HtmlParseFilter >>> { >>> >>> private static final Log LOG = >>> LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName()); >>> >>> private Configuration conf; >>> >>> public Parse filter(Content content, Parse parse, HTMLMetaTags >>> metaTags, >>> DocumentFragment doc) { >>> >>> ParseData parseData = parse.getData(); >>> Outlink[] outlinks = parseData.getOutlinks(); >>> >>> String text = parse.getText(); >>> >>> LOG.info("ComputerScienceConferenceHtmlParser: " + text); >>> >>> LOG.warn("BEFORE"); >>> for(int i=0; i<outlinks.length; i++) { >>> LOG.warn("Content Base URL: " + >>> content.getBaseUrl()); >>> LOG.warn("Outlink Anchor: " + >>> outlinks[i].getAnchor()); >>> LOG.warn("Outlinks ToURL: " + >>> outlinks[i].getToUrl()); >>> LOG.warn("Outlinks toString(): " + >>> outlinks[i].toString()); >>> LOG.warn("metaTags: " + >>> metaTags.getBaseHref().toString()); >>> } >>> LOG.warn("AFTER"); >>> >>> return parse; >>> } >>> >>> >>> public void setConf(Configuration conf) { >>> this.conf = conf; >>> } >>> >>> public Configuration getConf() { >>> return this.conf; >>> } >>> } >>> -- >>> View this message in context: >>> http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html >>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >>> >> >> > > -- View this message in context: http://www.nabble.com/Pulling-out-URLs-tp22469643p22475608.html Sent from the Nutch - User mailing list archive at Nabble.com.