RE: Can Nutch process rel-tag likes rel="nofollow"?

weishenyun Thu, 16 Aug 2012 02:07:50 -0700

Well, I have read TikaParser.java code in Nutch 1.x and Nutch 2.0. I can
easily get source code like these below.


if (!metaTags.getNoFollow()) { // okay to follow links
      ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
      URL baseTag = utils.getBase(root);
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting links...");
      }
      utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
      outlinks = l.toArray(new Outlink[l.size()]);
      if (LOG.isTraceEnabled()) {
        LOG.trace("found " + outlinks.length + " outlinks in " + base);
      }
    }

But I think these code is trying to process nofollow or noIndex in metadata
tags. For example, <meta name="robots" content="nofollow"> or <meta
name="robots" content="noindex">. And these tags control all the links on
that page.

But my problem is that a single link on one page just like  a
href="http://www.google.com"; rel="nofollow" . In this case, will Nutch
discard this link according to tags rel='nofollow'. 
Thanks Markus. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Nutch-process-rel-tag-likes-rel-nofollow-tp4001541p4001582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Can Nutch process rel-tag likes rel="nofollow"?

Reply via email to