Hi,
The attached patch makes a world of difference in my case. I'm trying to index some graphics-rich websites, and many links are hidden within image maps. This patch allows us to collect and traverse links contained in <area href="..."> elements.
Enjoy!
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
Index: DOMContentUtils.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/util/DOMContentUtils.java,v
retrieving revision 1.3
diff -b -d -u -r1.3 DOMContentUtils.java
--- DOMContentUtils.java 17 Jul 2003 22:12:47 -0000 1.3
+++ DOMContentUtils.java 19 May 2004 11:43:28 -0000
@@ -129,6 +129,9 @@
// of nekohtml's DOM-fixup process...
private static boolean shouldThrowAwayLink(Node node, NodeList children,
int childLen) {
+ if (node.getNodeName().equalsIgnoreCase("area")) {
+ return false;
+ }
if (childLen == 0) {
// this has no inner structure
return true;
@@ -201,7 +204,8 @@
childLen= children.getLength();
if (node.getNodeType() == Node.ELEMENT_NODE) {
- if ("a".equalsIgnoreCase(node.getNodeName())) {
+ if ("a".equalsIgnoreCase(node.getNodeName()) ||
+ "area".equalsIgnoreCase(node.getNodeName())) {
if (shouldThrowAwayLink(node, children, childLen)) {
// this has no inner structure or just a single nested
