I have started the Jetty server, configured the web crawler, a Solr connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds of document types (pdf, html, doc etc.).

I have three questions.

1. Why do I now have a lot of these lines in the above host's access_log after the crawler has been started? 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588 "-" "ApacheManifoldCFWebCrawler;" 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588 "-" "ApacheManifoldCFWebCrawler;" 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588 "-" "ApacheManifoldCFWebCrawler;" 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is it fetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's web crawler collects? There is a field mapper in the job configuration, but I only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If it does not use Apache Tika, is it possible to configure the web crawler to use Tika for document parsing and language detection?

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to