I've checked it, the source is in DOMContentUtils. Anchors with rel="nofollow"
are discarded.
-----Original message-----
> From:weishenyun <[email protected]>
> Sent: Thu 16-Aug-2012 11:09
> To: [email protected]
> Subject: RE: Can Nutch process rel-tag likes rel="nofollow"?
>
> Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
> easily get source code like these below.
>
> if (!metaTags.getNoFollow()) { // okay to follow links
> ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
> URL baseTag = utils.getBase(root);
> if (LOG.isTraceEnabled()) {
> LOG.trace("Getting links...");
> }
> utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
> outlinks = l.toArray(new Outlink[l.size()]);
> if (LOG.isTraceEnabled()) {
> LOG.trace("found " + outlinks.length + " outlinks in " + base);
> }
> }
>
> But I think these code is trying to process nofollow or noIndex in metadata
> tags. For example, <meta name="robots" content="nofollow"> or <meta
> name="robots" content="noindex">. And these tags control all the links on
> that page.
>
> But my problem is that a single link on one page just like a
> href="http://www.google.com" rel="nofollow" . In this case, will Nutch
> discard this link according to tags rel='nofollow'.
> Thanks Markus.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>