Hmm. I get 701 bytes from your seed, but no parseable links. Investigating... Karl
On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > On 28.01.11 14.32, Karl Wright wrote: >> >> Thanks. I tested my changes enough so that I was confident in >> committing the patch, so the changes are in trunk. > > I'm afraid that it doesn't work properly. I downloaded the latest version > from trunk and started the crawler. > > Try to use the following address in your seed list and the following rule in > the includes list: > ^http://ridder.uio.no/.* > > The following document was fetched and sent to Solr for indexing even though > it includes a robots noindex rule: > http://ridder.uio.no/test_closed/ > > Here's the line from the history telling me that Sole should index it: > 02-02-2011 16:12:33.283 document ingest (Solr) > http://ridder.uio.no/test_closed/ > 200 > > I can try to modify the code you have added in order to get around this > tomorrow. I guess I can find the relevant check somewhere in the following > folder? > mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >