Not so easy with RegEx; what if <title   > instead of <title>, or, for
instance,

<title

>My &amp; Title </p><tit

Le>

 

-    Neko HTML wiill survive; RegEx won’t...

 

 

DOMContentUtils has a method getTitle()...

 

Alexei suggested to review this:

      if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD

        return false;

      }

 

It does not work if <title> is after <body>...

 

 

Note that Neko HTML parser may automatically handle some Webmaster errors...
but not sure if it can put title inside <head>...

 

 

 

From: Magnús Skúlason [mailto:magg...@gmail.com] 
Sent: August-28-09 3:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Title inside body

 

Hi,

 

This should be easy, try something like

 

      if (title.equals("")) {

             Pattern p = Pattern.compile("\\<title\\>.?\\<\\/title\\>");

             Matcher m = p.matcher(text);

             if (m.find()) {

                         title = m.group();

             }

      }

 

after line 194 in HtmlParser.java

 

Best regards,

Magnus

 

On Fri, Aug 28, 2009 at 8:07 PM, Alexey Torochkov <all.net...@gmail.com>
wrote:

 

On Fri, Aug 28, 2009 at 7:39 PM, Fuad Efendi <f...@efendi.ca> wrote:

Some bad guys even put <div> before <html> tag – check Google cached page J

(just joking...)

Wonderfully browsers understand that...

:-P

Without sarcasm and irony... I just wanted to say that if a page have a
title - it should be extracted anyway

-- 
Alexey Torochkov 

 

Reply via email to