Hi,

The attached patch makes a world of difference in my case. I'm trying to index some graphics-rich websites, and many links are hidden within image maps. This patch allows us to collect and traverse links contained in <area href="..."> elements.

Enjoy!

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

Index: DOMContentUtils.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/util/DOMContentUtils.java,v
retrieving revision 1.3
diff -b -d -u -r1.3 DOMContentUtils.java
--- DOMContentUtils.java        17 Jul 2003 22:12:47 -0000      1.3
+++ DOMContentUtils.java        19 May 2004 11:43:28 -0000
@@ -129,6 +129,9 @@
   // of nekohtml's DOM-fixup process...
   private static boolean shouldThrowAwayLink(Node node, NodeList children, 
                                               int childLen) {
+      if (node.getNodeName().equalsIgnoreCase("area")) {
+          return false;
+      }
     if (childLen == 0) {
       // this has no inner structure 
       return true;
@@ -201,7 +204,8 @@
       childLen= children.getLength();
   
     if (node.getNodeType() == Node.ELEMENT_NODE) {
-      if ("a".equalsIgnoreCase(node.getNodeName())) {
+      if ("a".equalsIgnoreCase(node.getNodeName()) ||
+              "area".equalsIgnoreCase(node.getNodeName())) {
 
         if (shouldThrowAwayLink(node, children, childLen)) {
           // this has no inner structure or just a single nested

Reply via email to