Ken Gregoire wrote:

look here, it is blocking robots: http://ulysses.wyona.org/robots.txt


right, but shouldn't it just block the URL /foo/bar.html?

Maybe I completely misunderstand how a robots.txt should be written or is it possible
that Nutch doesn't really parse "Disallow"?

Also I have commented the Disallow http://ulysses.wyona.org/robots.txt
but get the same result resp. just one page crawled.

So I am not sure if has anything to do with the robots.txt

Thanks

Michi


User-agent: *
Disallow: /foo/bar.html

User-agent: lenya
Disallow: /foo/bar.html





Michael Wechner wrote:

Hi

I am trying to index http://ulysses.wyona.org/ but somehow it just indexes the homepage but doesn't seem to follow any links. I have set "depth 3" and other sites are being crawled deeper without a problem but not the Ulysses page.

Has anyone made similar experiences?

Is it possible that Nutch has problem with well-formed XHTML (application/xhtml+xml)?

Thanks

Michi




--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61

Reply via email to