Re: Pulling out URLs

vishal vachhani Thu, 12 Mar 2009 03:15:16 -0700

Simple solution would be done the segments using following command and just
write a script which can extract the Outlinks present in the documents of
the segment.


$NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent -nofetch
-nogenerate -noparse -noparsetext

this will give you a dump file. run a script and you will get outlinks.

On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote:

>
> Hi @ all,
>
> I started to write my own plugin. I extended the HtmlParserFilter to grab
> outlinks to other pages, but it looks like that the outlinks are just links
> to css or js files, or am I wrong? What is the best way to extract all
> outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will find
> my
> code below...
>
> public class ComputerScienceConferenceHtmlParser implements HtmlParseFilter
> {
>
>        private static final Log LOG =
> LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName());
>
>        private Configuration conf;
>
>        public Parse filter(Content content, Parse parse, HTMLMetaTags
> metaTags,
> DocumentFragment doc) {
>
>                ParseData parseData = parse.getData();
>                Outlink[] outlinks = parseData.getOutlinks();
>
>                String text = parse.getText();
>
>                LOG.info("ComputerScienceConferenceHtmlParser: " + text);
>
>                LOG.warn("BEFORE");
>                for(int i=0; i<outlinks.length; i++) {
>                        LOG.warn("Content Base URL: " +
> content.getBaseUrl());
>                        LOG.warn("Outlink Anchor: " +
> outlinks[i].getAnchor());
>                        LOG.warn("Outlinks ToURL: " +
> outlinks[i].getToUrl());
>                        LOG.warn("Outlinks toString(): " +
> outlinks[i].toString());
>                        LOG.warn("metaTags: " +
> metaTags.getBaseHref().toString());
>                }
>                LOG.warn("AFTER");
>
>                return parse;
>        }
>
>
>        public void setConf(Configuration conf) {
>                this.conf = conf;
>        }
>
>        public Configuration getConf() {
>                return this.conf;
>        }
> }
> --
> View this message in context:
> http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Pulling out URLs

Reply via email to