Re: [Nutch-general] Nutch doesn't dive deeper

Michael Wechner Fri, 18 Aug 2006 04:50:06 -0700

Ken Gregoire wrote:

> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt



right, but shouldn't it just block the URL /foo/bar.html?

Maybe I completely misunderstand how a robots.txt should be written or 
is it possible
that Nutch doesn't really parse "Disallow"?

Also I have commented the Disallow http://ulysses.wyona.org/robots.txt
but get the same result resp. just one page crawled.

So I am not sure if has anything to do with the robots.txt

Thanks

Michi

>
> User-agent: *
> Disallow: /foo/bar.html
>
> User-agent: lenya
> Disallow: /foo/bar.html
>
>
>
>
>
> Michael Wechner wrote:
>
>> Hi
>>
>> I am trying to index http://ulysses.wyona.org/ but somehow it just 
>> indexes the homepage but doesn't seem to follow
>> any links. I have set "depth 3" and other sites are being crawled 
>> deeper without a problem but not the Ulysses page.
>>
>> Has anyone made similar experiences?
>>
>> Is it possible that Nutch has problem with well-formed XHTML 
>> (application/xhtml+xml)?
>>
>> Thanks
>>
>> Michi
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch doesn't dive deeper

Reply via email to