Re: Indexing Solr with the web crawler

Jack Krupansky Thu, 20 Jan 2011 06:32:03 -0800

The Solr connector is designed to send raw document content (unparsed) toSolr Cell (the ExtractingRequestHandler) which then uses Tika for mime typedetection and document parsing. If you run Tika directly it will tell youwhat metadata is extracted from a particular document type, which varies.


See:
http://wiki.apache.org/solr/ExtractingRequestHandler

You can also access Solr Cell with the "Extract Only" option to see whatTika is generating within Solr Cell for a particular input document and thenuse those metadata field names to construct MCF field mappings to yourschema fields.


See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only

-- Jack Krupansky

-----Original Message-----From: Erlend Garåsen

Sent: Thursday, January 20, 2011 9:08 AM
To: connectors-user@incubator.apache.org
Subject: Indexing Solr with the web crawler

I have started the Jetty server, configured the web crawler, a Solr
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds
of document types (pdf, html, doc etc.).

I have three questions.

1. Why do I now have a lot of these lines in the above host's access_log
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is it
fetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's web
crawler collects? There is a field mapper in the job configuration, but
I only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.?
If it does not use Apache Tika, is it possible to configure the web
crawler to use Tika for document parsing and language detection?

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway

Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Reply via email to