Hi
It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
http://www.yulup.org/
2006-12-20 14:22:22,375 INFO fetcher.Fetcher - fetching
http://www.yulup.org/
2006-12-20 14:22:22,684 WARN parse.ParserFactory -
ParserFactory:Plugin:
Michael Wechner wrote:
Hi
It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
Try changing the following in your parse-plugins.xml
mimeType name=application/xhtml+xml
plugin id=parse-html /
/mimeType
This was changed in trunk
Hi
There are are several posts about the difference between
regex-urlfilter.txt crawl-urlfilter.txt
e.g.http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06318.html
or
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200503.mbox/[EMAIL PROTECTED]
but it might stupid,
[
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
-
You may also want to make the status codes ORed values, so that, for example,
all of the various kinds of failure all have a
Hello,
My crawl index is not being created correctly using the new settings.
Although the log shows no errors - I am not able to open using Luke,
it says index corrupt, access denied, invalid index etc
what could be wrong ? Also the size of the index is rather small - 8Kb or
so...:-(
And no