The Solr connector is designed to send raw document content (unparsed) to
Solr Cell (the ExtractingRequestHandler) which then uses Tika for mime type
detection and document parsing. If you run Tika directly it will tell you
what metadata is extracted from a particular document type, which varies.
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
You can also access Solr Cell with the "Extract Only" option to see what
Tika is generating within Solr Cell for a particular input document and then
use those metadata field names to construct MCF field mappings to your
schema fields.
See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
-- Jack Krupansky
-----Original Message-----
From: Erlend Garåsen
Sent: Thursday, January 20, 2011 9:08 AM
To: connectors-user@incubator.apache.org
Subject: Indexing Solr with the web crawler
I have started the Jetty server, configured the web crawler, a Solr
connector and created a job. First I try to crawl the following site:
http://ridder.uio.no/
which contains nothing but an index.html with links to different kinds
of document types (pdf, html, doc etc.).
I have three questions.
1. Why do I now have a lot of these lines in the above host's access_log
after the crawler has been started?
193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200
588 "-" "ApacheManifoldCFWebCrawler;"
What is the crawler trying to do which it probably cannot do? Why is it
fetching the same URL over and over again?
2. How can I index Solr when I don't know which fields ManifoldCF's web
crawler collects? There is a field mapper in the job configuration, but
I only know about the fields I have configured in Solr's schema.xml.
3. Will the web crawler parse document types such as PDF, doc, rtf etc.?
If it does not use Apache Tika, is it possible to configure the web
crawler to use Tika for document parsing and language detection?
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050