Re: Web crawler does not follow the robots meta tag rules

Erlend Garåsen Thu, 03 Feb 2011 01:58:13 -0800

Honestly, I haven't modified any crawler code at all. Are you sure youentered a url with an trailing slash in the seed list? I tried to skipthat slash, and then the crawler began to act strangely. I cannotreproduce your results.


This is my settings:
Seed: http://ridder.uio.no/

Inclusions: ^http://ridder.uio.no/.* (marked "include only host matching...")

Everything works like a dream. The only problem I have with the PDFdocument is that it does not parse the Norwegian characters correctly,but this can be a Tika bug since all other document formats are parsedcorrectly.

BTW: I did a svn update, ant clean -> build, and now the document withthe noindex rule is skipped. Great. Thanks a zillion!

And regarding the Solr trick with the jar files I had to move manuallysince they were excluded from solr.jar (my last home lesson):- When Solr is running in a servlet container such as Resin, you have tomove the following jars manually into the <solr.home>/lib directory inorder to enable the ExtractingRequestHandler:

  - apache-solr-cell-*.jar
  - the other Tika jars

You will find the same information in the following file:
solr_trunk/solr/contrib/extraction/CHANGES.txt.

Erlend


On 02.02.11 17.34, Karl Wright wrote:

Turns out Java doesn't like the form of those URLs; it doesn't they're proper:

WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
path in absolute URI: http://ridder.uio.nodokument.pdf
WEB: In html document 'http://ridder.uio.no', found an unincluded URL
'dokument.pdf'

This is the java.net.URI class:

         java.net.URI parentURL = new java.net.URI(parentIdentifier);
         url = parentURL.resolve(rawURL);

... and this is throwing a java.net.URISyntaxException.

I'm going to have to go look at the standards to figure out what we
should do here.  Perhaps the right approach is to note the exception
and retry with a "/" glommed on the front if we get it.

But clearly you must have modified the web connector in order to get
it to crawl your stuff in the first place.

Karl

On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright<daddy...@gmail.com>  wrote:

Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
Karl

On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen<e.f.gara...@usit.uio.no>  wrote:

On 28.01.11 14.32, Karl Wright wrote:


Thanks.  I tested my changes enough so that I was confident in
committing the patch, so the changes are in trunk.


I'm afraid that it doesn't work properly. I downloaded the latest version
from trunk and started the crawler.

Try to use the following address in your seed list and the following rule in
the includes list:
^http://ridder.uio.no/.*

The following document was fetched and sent to Solr for indexing even though
it includes a robots noindex rule:
http://ridder.uio.no/test_closed/

Here's the line from the history telling me that Sole should index it:
02-02-2011 16:12:33.283         document ingest (Solr)
http://ridder.uio.no/test_closed/
        200

I can try to modify the code you have added in order to get around this
tomorrow. I guess I can find the relevant check somewhere in the following
folder?
mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Reply via email to