Of course, it's absolutely incorrect to put the <title> tag inside the <body>, but I discovered that some web developers do this. Google, yahoo, bing (and many others, including browsers) can retrieve the contents of the <title> tag if it's placed inside the body instead of the head.
Is it possible to do somethink like this? Or at least make it configurable? Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java =================================================================== --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java (revision 808307) +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java (working copy) @@ -180,10 +180,6 @@ Node currentNode = walker.nextNode(); String nodeName = currentNode.getNodeName(); short nodeType = currentNode.getNodeType(); - - if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD - return false; - } if (nodeType == Node.ELEMENT_NODE) { if ("title".equalsIgnoreCase(nodeName)) { -- Alexey Torochkov