> It says: > > 01-20-2011 15:14:18.914 document ingest (solr_indexer) > http://ridder.uio.no/ > 500 588 9 lazy loading error
So what is happening is that either your solr instance or your Solr output connection is misconfigured, and when ManifoldCF tries to send the document to Solr it returns with an error. I don't know what Solr's "lazy loading error" is, but hopefully you can find out either from the doc or from the Solr/Lucene newsgroup. > Thanks for clarifying. I can try to configure Solr to parse these documents. > Nutch did a good job except that it cannot detect whether a document was > modified in order to send an update/delete commando to Solr. That function > is crucial for us. That's exactly what ManifoldCF is good at. > I'm unsure about what you mean by anonymous fields in Solr. I cannot define > the fields I need in schema.xml as I want? I have created duplicate fields > for title and content in order to use different stemmers (I need to support > English and Norwegian). In Nutch there is a simple configuration file for > mapping fields from Nutch to Solr. I'm probably using the wrong terminology. I think they are actually called "dynamic fields". > I havent't filled out the "expiration interval (if continuous)." under the > scheduling folder. Is this the reason why ManifoldCF is recrawling the page > every minute? The reason it's retrying is because the Solr connector is getting that error, and it's telling ManifoldCF that it should retry. That's because it hasn't figured out that the error is due to setup, rather than some transient condition. The expiration model for continuous crawling is going to take more to describe than I can here. I suggest you read about it in the online end-user documentation. If that's not enough, there's a book on the way from Manning Publishing, called ManifoldCF in Action. There should be some chapters that might help you available soon through the Manning Early Access Program. Thanks! Karl On Thu, Jan 20, 2011 at 9:50 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > On 20.01.11 15.21, Karl Wright wrote: >> >> Hi Erlend, > > Hi Karl, > > Thank you for replying and for your comments. It's very appreciated. > >> (1) The best way to find out what ManifoldCF thinks it is doing is to >> look at the Simple History report in the UI. > > It says: > > 01-20-2011 15:14:18.914 document ingest (solr_indexer) > http://ridder.uio.no/ > 500 588 9 lazy loading error > 01-20-2011 15:14:18.800 fetch http://ridder.uio.no/ > 200 588 103 > 01-20-2011 15:13:18.581 document ingest (solr_indexer) > http://ridder.uio.no/ > 500 588 16 lazy loading error > 01-20-2011 15:13:18.448 fetch http://ridder.uio.no/ > 200 588 111 > > >> (2) The Web Connector in ManifoldCF does not have the ability, at this >> time, to extract links from Word docs, pdfs, etc., but Solr can >> extract *content* from these documents if you configure it to use >> Tika. The document is sent to Solr in binary form, and Tika extracts >> whatever metadata it can find. ManifoldCF does not get involved in >> that at all. Usually, setting up Solr with anonymous fields is the >> way to go in this case. > > Thanks for clarifying. I can try to configure Solr to parse these documents. > Nutch did a good job except that it cannot detect whether a document was > modified in order to send an update/delete commando to Solr. That function > is crucial for us. > > I'm unsure about what you mean by anonymous fields in Solr. I cannot define > the fields I need in schema.xml as I want? I have created duplicate fields > for title and content in order to use different stemmers (I need to support > English and Norwegian). In Nutch there is a simple configuration file for > mapping fields from Nutch to Solr. > >> If this is an open site, I'll crawl it here myself momentarily and let >> you know what I find. > > Please do that. It's just my workstation with an Apache server running. It's > open. > > BTW, I think I have set things up correctly for the crawler: > Seeds: http://ridder.uio.no/ > Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts > matching seeds) > > I havent't filled out the "expiration interval (if continuous)." under the > scheduling folder. Is this the reason why ManifoldCF is recrawling the page > every minute? > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >