Re: Indexing Solr with the web crawler

Jack Krupansky Thu, 20 Jan 2011 07:16:26 -0800

Here's one email thread that details at least one cause of the lazy loadingerror:


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200910.mbox/%3c4ad5ec8c.6000...@gmail.com%3E


-- Jack Krupansky

-----Original Message-----From: Karl Wright

Sent: Thursday, January 20, 2011 10:02 AM
To: connectors-user@incubator.apache.org
Subject: Re: Indexing Solr with the web crawler

It says:

01-20-2011 15:14:18.914         document ingest (solr_indexer)
http://ridder.uio.no/
       500     588     9       lazy loading error


So what is happening is that either your solr instance or your Solr
output connection is misconfigured, and when ManifoldCF tries to send
the document to Solr it returns with an error.  I don't know what
Solr's "lazy loading error" is, but hopefully you can find out either
from the doc or from the Solr/Lucene newsgroup.

Thanks for clarifying. I can try to configure Solr to parse thesedocuments.
Nutch did a good job except that it cannot detect whether a document was
modified in order to send an update/delete commando to Solr. That function
is crucial for us.


That's exactly what ManifoldCF is good at.

I'm unsure about what you mean by anonymous fields in Solr. I cannotdefine
the fields I need in schema.xml as I want? I have created duplicate fields
for title and content in order to use different stemmers (I need tosupport
English and Norwegian). In Nutch there is a simple configuration file for
mapping fields from Nutch to Solr.


I'm probably using the wrong terminology.  I think they are actually
called "dynamic fields".

I havent't filled out the "expiration interval (if continuous)." under the
scheduling folder. Is this the reason why ManifoldCF is recrawling thepage
every minute?


The reason it's retrying is because the Solr connector is getting that
error, and it's telling ManifoldCF that it should retry.  That's
because it hasn't figured out that the error is due to setup, rather
than some transient condition.

The expiration model for continuous crawling is going to take more to
describe than I can here.  I suggest you read about it in the online
end-user documentation.  If that's not enough, there's a book on the
way from Manning Publishing, called ManifoldCF in Action.  There
should be some chapters that might help you available soon through the
Manning Early Access Program.

Thanks!
Karl

On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e.f.gara...@usit.uio.no>wrote:

On 20.01.11 15.21, Karl Wright wrote:


Hi Erlend,


Hi Karl,

Thank you for replying and for your comments. It's very appreciated.

(1) The best way to find out what ManifoldCF thinks it is doing is to
look at the Simple History report in the UI.


It says:

01-20-2011 15:14:18.914         document ingest (solr_indexer)
http://ridder.uio.no/
       500     588     9       lazy loading error
01-20-2011 15:14:18.800         fetch   http://ridder.uio.no/
       200     588     103
01-20-2011 15:13:18.581         document ingest (solr_indexer)
http://ridder.uio.no/
       500     588     16      lazy loading error
01-20-2011 15:13:18.448         fetch   http://ridder.uio.no/
       200     588     111

(2) The Web Connector in ManifoldCF does not have the ability, at this
time, to extract links from Word docs, pdfs, etc., but Solr can
extract *content* from these documents if you configure it to use
Tika.  The document is sent to Solr in binary form, and Tika extracts
whatever metadata it can find.  ManifoldCF does not get involved in
that at all.  Usually, setting up Solr with anonymous fields is the
way to go in this case.

Thanks for clarifying. I can try to configure Solr to parse thesedocuments.

Nutch did a good job except that it cannot detect whether a document was
modified in order to send an update/delete commando to Solr. That function
is crucial for us.

I'm unsure about what you mean by anonymous fields in Solr. I cannotdefine

the fields I need in schema.xml as I want? I have created duplicate fields

for title and content in order to use different stemmers (I need tosupport

English and Norwegian). In Nutch there is a simple configuration file for
mapping fields from Nutch to Solr.

If this is an open site, I'll crawl it here myself momentarily and let
you know what I find.

Please do that. It's just my workstation with an Apache server running.It's

open.

BTW, I think I have set things up correctly for the crawler:
Seeds: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts
matching seeds)

I havent't filled out the "expiration interval (if continuous)." under the

scheduling folder. Is this the reason why ManifoldCF is recrawling thepage

every minute?

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway

Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:31050

Re: Indexing Solr with the web crawler

Reply via email to