Hello,

I would appreciate any guidance. in my web page I have
<!--noindex-->sometext<!--/noindex--> to stop sometext from being
indexed.

I have altered my DomContentUtils.java file in
nutch-0.8/src/java/org/apache/nutch/parse/html

to add the following lines:

>>> = lines I have added

.....
public class DOMContentUtils {

>>>  private static boolean noindex = false;
......

....
  }
    if (node.getNodeType() == Node.COMMENT_NODE) {
>>>      String text = node.getNodeValue();
>>>      if (text.equals("noindex")) {
>>>                noindex = true;
>>> }
>>>      if (text.equals("/noindex")) {
>>>                noindex = false;
>>> }
      return false;
    }

    if (node.getNodeType() == Node.TEXT_NODE) {
      // cleanup and trim the value
      String text = node.getNodeValue();
      text = text.replaceAll("\\s+", " ");
      text = text.trim();
>>>      if (text.length() > 0 && noindex == false) {
        if (sb.length() > 0) sb.append(' ');
        sb.append(text);
      }

i would assume that when we reach the first noindex comment it throws
the switch and sets noindex to true so it will not be processed by
following textnode.
when it hits the trailing no index it is all back on. 

but in my results it now indexes the comments? ie 

... in advanced sculptural <!--noindex-->basketry<!--/noindex-->
methods.</p> <p><span

I would really appreciate some help. If somebody could post an altered
DOMContentUtils.java file that they have working so I could reference
this.

kind regards
Phil
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to