Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
easily get source code like these below.
if (!metaTags.getNoFollow()) { // okay to follow links
ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
URL baseTag = utils.getBase(root);
if (LOG.isTraceEnabled()) {
LOG.trace("Getting links...");
}
utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
outlinks = l.toArray(new Outlink[l.size()]);
if (LOG.isTraceEnabled()) {
LOG.trace("found " + outlinks.length + " outlinks in " + base);
}
}
But I think these code is trying to process nofollow or noIndex in metadata
tags. For example, <meta name="robots" content="nofollow"> or <meta
name="robots" content="noindex">. And these tags control all the links on
that page.
But my problem is that a single link on one page just like a
href="http://www.google.com" rel="nofollow" . In this case, will Nutch
discard this link according to tags rel='nofollow'.
Thanks Markus.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.