Hi Erlend, (1) The best way to find out what ManifoldCF thinks it is doing is to look at the Simple History report in the UI.
(2) The Web Connector in ManifoldCF does not have the ability, at this time, to extract links from Word docs, pdfs, etc., but Solr can extract *content* from these documents if you configure it to use Tika. The document is sent to Solr in binary form, and Tika extracts whatever metadata it can find. ManifoldCF does not get involved in that at all. Usually, setting up Solr with anonymous fields is the way to go in this case. If this is an open site, I'll crawl it here myself momentarily and let you know what I find. Karl On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > > I have started the Jetty server, configured the web crawler, a Solr > connector and created a job. First I try to crawl the following site: > http://ridder.uio.no/ > which contains nothing but an index.html with links to different kinds of > document types (pdf, html, doc etc.). > > I have three questions. > > 1. Why do I now have a lot of these lines in the above host's access_log > after the crawler has been started? > 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588 > "-" "ApacheManifoldCFWebCrawler;" > 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588 > "-" "ApacheManifoldCFWebCrawler;" > 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588 > "-" "ApacheManifoldCFWebCrawler;" > 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588 > "-" "ApacheManifoldCFWebCrawler;" > > What is the crawler trying to do which it probably cannot do? Why is it > fetching the same URL over and over again? > > 2. How can I index Solr when I don't know which fields ManifoldCF's web > crawler collects? There is a field mapper in the job configuration, but I > only know about the fields I have configured in Solr's schema.xml. > > 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If > it does not use Apache Tika, is it possible to configure the web crawler to > use Tika for document parsing and language detection? > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >