The problem is it may slow down performance... although I don't think so...

 

Another problem is Web Browser: will it show such title on a browser title
bar, when user bookmarks page, etc... and what webmaster had in mind when
he/she used <title> tag after <body>... can it contain embedded table? 

 

And, most probably it is responsibility of good HTML parser, to correct
HTMLs on a best guess effort... 

 

 

 

From: Alexey Torochkov [mailto:all.net...@gmail.com] 
Sent: August-28-09 10:39 AM
To: nutch-dev@lucene.apache.org
Subject: Title inside body

 

Of course, it's absolutely incorrect to put the <title> tag inside the
<body>, but I discovered that some web developers do this.
Google, yahoo, bing (and many others, including browsers) can retrieve the
contents of the <title> tag if it's placed inside the body instead of the
head.

 

Is it possible to do somethink like this? Or at least make it configurable

 

Index:
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j
ava

===================================================================

---
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j
ava         (revision 808307)

+++
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.j
ava      (working copy)

@@ -180,10 +180,6 @@

       Node currentNode = walker.nextNode();

       String nodeName = currentNode.getNodeName();

       short nodeType = currentNode.getNodeType();

-      

-      if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD

-        return false;

-      }

   

       if (nodeType == Node.ELEMENT_NODE) {

         if ("title".equalsIgnoreCase(nodeName)) {


-- 
Alexey Torochkov 

Reply via email to