Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
Karl

On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote:
> On 28.01.11 14.32, Karl Wright wrote:
>>
>> Thanks.  I tested my changes enough so that I was confident in
>> committing the patch, so the changes are in trunk.
>
> I'm afraid that it doesn't work properly. I downloaded the latest version
> from trunk and started the crawler.
>
> Try to use the following address in your seed list and the following rule in
> the includes list:
> ^http://ridder.uio.no/.*
>
> The following document was fetched and sent to Solr for indexing even though
> it includes a robots noindex rule:
> http://ridder.uio.no/test_closed/
>
> Here's the line from the history telling me that Sole should index it:
> 02-02-2011 16:12:33.283         document ingest (Solr)
> http://ridder.uio.no/test_closed/
>        200
>
> I can try to modify the code you have added in order to get around this
> tomorrow. I guess I can find the relevant check somewhere in the following
> folder?
> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Reply via email to