Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
easily get source code like these below.

if (!metaTags.getNoFollow()) { // okay to follow links
      ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
      URL baseTag = utils.getBase(root);
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting links...");
      }
      utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
      outlinks = l.toArray(new Outlink[l.size()]);
      if (LOG.isTraceEnabled()) {
        LOG.trace("found " + outlinks.length + " outlinks in " + base);
      }
    }

But I think these code is trying to process nofollow or noIndex in metadata
tags. For example, <meta name="robots" content="nofollow"> or <meta
name="robots" content="noindex">. And these tags control all the links on
that page.

But my problem is that a single link on one page just like  a
href="http://www.google.com"; rel="nofollow" . In this case, will Nutch
discard this link according to tags rel='nofollow'. 
Thanks Markus. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to