Indexing Solr with the web crawler

Erlend Garåsen Thu, 20 Jan 2011 06:08:58 -0800

I have started the Jetty server, configured the web crawler, a Solrconnector and created a job. First I try to crawl the following site:

http://ridder.uio.no/

which contains nothing but an index.html with links to different kindsof document types (pdf, html, doc etc.).


I have three questions.

1. Why do I now have a lot of these lines in the above host's access_logafter the crawler has been started?193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200588 "-" "ApacheManifoldCFWebCrawler;"193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200588 "-" "ApacheManifoldCFWebCrawler;"193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200588 "-" "ApacheManifoldCFWebCrawler;"193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is itfetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's webcrawler collects? There is a field mapper in the job configuration, butI only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.?If it does not use Apache Tika, is it possible to configure the webcrawler to use Tika for document parsing and language detection?


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Indexing Solr with the web crawler

Reply via email to