The problem is it may slow down performance... although I don't think so...
Another problem is Web Browser: will it show such title on a browser title bar, when user bookmarks page, etc... and what webmaster had in mind when he/she used <title> tag after <body>... can it contain embedded table? And, most probably it is responsibility of good HTML parser, to correct HTMLs on a best guess effort... From: Alexey Torochkov [mailto:all.net...@gmail.com] Sent: August-28-09 10:39 AM To: nutch-dev@lucene.apache.org Subject: Title inside body Of course, it's absolutely incorrect to put the <title> tag inside the <body>, but I discovered that some web developers do this. Google, yahoo, bing (and many others, including browsers) can retrieve the contents of the <title> tag if it's placed inside the body instead of the head. Is it possible to do somethink like this? Or at least make it configurable Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j ava =================================================================== --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j ava (revision 808307) +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j ava (working copy) @@ -180,10 +180,6 @@ Node currentNode = walker.nextNode(); String nodeName = currentNode.getNodeName(); short nodeType = currentNode.getNodeType(); - - if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD - return false; - } if (nodeType == Node.ELEMENT_NODE) { if ("title".equalsIgnoreCase(nodeName)) { -- Alexey Torochkov