Not so easy with RegEx; what if <title > instead of <title>, or, for instance,
<title >My & Title </p><tit Le> - Neko HTML wiill survive; RegEx wont... DOMContentUtils has a method getTitle()... Alexei suggested to review this: if ("body".equalsIgnoreCase(nodeName)) { // stop after HEAD return false; } It does not work if <title> is after <body>... Note that Neko HTML parser may automatically handle some Webmaster errors... but not sure if it can put title inside <head>... From: Magnús Skúlason [mailto:magg...@gmail.com] Sent: August-28-09 3:42 PM To: nutch-dev@lucene.apache.org Subject: Re: Title inside body Hi, This should be easy, try something like if (title.equals("")) { Pattern p = Pattern.compile("\\<title\\>.?\\<\\/title\\>"); Matcher m = p.matcher(text); if (m.find()) { title = m.group(); } } after line 194 in HtmlParser.java Best regards, Magnus On Fri, Aug 28, 2009 at 8:07 PM, Alexey Torochkov <all.net...@gmail.com> wrote: On Fri, Aug 28, 2009 at 7:39 PM, Fuad Efendi <f...@efendi.ca> wrote: Some bad guys even put <div> before <html> tag check Google cached page J (just joking...) Wonderfully browsers understand that... :-P Without sarcasm and irony... I just wanted to say that if a page have a title - it should be extracted anyway -- Alexey Torochkov