Of course, it's absolutely incorrect to put the <title> tag inside the
<body>, but I discovered that some web developers do this.
Google, yahoo, bing (and many others, including browsers) can retrieve the
contents of the <title> tag if it's placed inside the body instead of the
head.

Is it possible to do somethink like this? Or at least make it configurable?

Index:
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
===================================================================
---
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
(revision
808307)
+++
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
(working
copy)
@@ -180,10 +180,6 @@
       Node currentNode = walker.nextNode();
       String nodeName = currentNode.getNodeName();
       short nodeType = currentNode.getNodeType();
-
-      if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD
-        return false;
-      }

       if (nodeType == Node.ELEMENT_NODE) {
         if ("title".equalsIgnoreCase(nodeName)) {

-- 
Alexey Torochkov

Reply via email to