Extracting title from XHTML pages

2006-12-20 Thread Michael Wechner
Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. http://www.yulup.org/ 2006-12-20 14:22:22,375 INFO fetcher.Fetcher - fetching http://www.yulup.org/ 2006-12-20 14:22:22,684 WARN parse.ParserFactory - ParserFactory:Plugin:

Re: Extracting title from XHTML pages

2006-12-20 Thread Sami Siren
Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html / /mimeType This was changed in trunk

difference between intranet and internet crawling

2006-12-20 Thread Michael Wechner
Hi There are are several posts about the difference between regex-urlfilter.txt crawl-urlfilter.txt e.g.http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06318.html or http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200503.mbox/[EMAIL PROTECTED] but it might stupid,

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

2006-12-20 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] Doug Cook commented on NUTCH-416: - You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a

Re: implement thai language indexing and search

2006-12-20 Thread sanjeev
Hello, My crawl index is not being created correctly using the new settings. Although the log shows no errors - I am not able to open using Luke, it says index corrupt, access denied, invalid index etc what could be wrong ? Also the size of the index is rather small - 8Kb or so...:-( And no