Hi Karl, I managed to get round my 'out of memory issue' with Solr by tweaking the Solr configuration.
Now, I have documents that can take ages to be indexed by Solr. I set a reasonable value for the socket timeout of the Solr connector (1200 sec). Still I get timeouts even then. If a timeout occurs, the MCF crawling stops. If I restart it, the file that timed out gets indexed again... and so on. What is your recommendation in such situation ? Many thanks, -----Message d'origine----- De : Karl Wright [mailto:[email protected]] Envoyé : jeudi 22 octobre 2015 18:23 À : dev Objet : Re: [Solr] Error on documents makes ManifoldCF Hi Fred, When a java process runs out of memory in one thread, *all* threads are likely impacted. That's why if you are seeing memory issues you really just have to fix them; you can't just ignore the exception and hope for the best. Karl On Thu, Oct 22, 2015 at 12:20 PM, Frédéric Olier <[email protected]> wrote: > Hi Karl, > > Indeed, I have this in my logs: > > MCF: > > Exception tossed: Repeated service interruptions - failure processing > document: Read timed out > > > Solr > > Error for /datafari-solr/FileShare/update/extract > java.lang.OutOfMemoryError: Java heap space > > > The file is not that big (7M). > > Although ignoring the file might not be the 'nicest' solution, is that > possible ? > > I'll investigate on Solr / Tika side to see if I can deactivate the > recursive parsing of archive files. > > Thanks anyway, > Fred. > > > > -----Message d'origine----- > De : Karl Wright [mailto:[email protected]] Envoyé : jeudi 22 octobre > 2015 18:16 À : dev Objet : Re: [Solr] Error on documents makes > ManifoldCF > > Hi Fred, > > I suspect that you are getting an out-of-memory or out-of-disk error > on the Solr side. That's really bad and you don't just want to make > ManifoldCF ignore it. > > What you can do is limit the maximum size file sent to Solr. That's a > far better fix. > > Karl > > > On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <[email protected]> wrote: > > > Hi, > > > > I managed to progress on my issues. > > > > The document (docx) is now skipped as expected when it fails. > > > > However, I have now another issue. > > I have a tar.gz file containing itself 100+ tar.gz files. > > > > ManifoldCF gets an 500 error from Solr which makes the crawling to abort. > > I looked at the Solr configuration and due to the hardware used I > > won't be able to tweak more the JVM and so on. > > > > Therefore I'd like to know whether ManifoldCF can be configured to > > skipped files for which it gets such an error instead of aborting ? > > > > Fred. > > > > > > -----Message d'origine----- > > De : Frédéric Olier [mailto:[email protected]] Envoyé : mercredi 21 > > octobre 2015 17:51 À : [email protected] Objet : RE: [Solr] > > Error on documents makes ManifoldCF > > > > Hi Karl, > > > > Many thanks. > > > > I found the configuration to use: > > Here > > > > http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and > > -s > > olr-for-files-search/ > > > > Search for "ignoreTikaException" > > > > I'll test it and see if it fixes my issue. > > > > Fred > > > > > > -----Message d'origine----- > > De : Karl Wright [mailto:[email protected]] Envoyé : mercredi 21 > > octobre > > 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes > > ManifoldCF > > > > Standard google searching finds it. > > > > See: > > > > > > http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox > > /% [email protected]%3E > > > > Karl > > > > > > On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <[email protected]> > wrote: > > > > > Hi, > > > > > > Thanks for your reply. > > > > > > I looked here : > > > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/ > > > > > > But there is no 'search' option... > > > > > > Any idea where I can search what I'm looking for more efficiently ? > > > > > > Thanks > > > > > > > > > -----Message d'origine----- > > > De : Karl Wright [mailto:[email protected]] Envoyé : mercredi 21 > > > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents > > > makes ManifoldCF > > > > > > Hi Frédéric, > > > > > > There's a flag in the Solr configuration you can set that will > > > cause exceptions from Solr Cell (Tika) to cause the document to be > > > skipped rather than causing ManifoldCF to retry the document. I > > > don't remember what it is but others have noted it and you can > > > search the mail > > archive to find it. > > > > > > Thanks, > > > Karl > > > > > > > > > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <[email protected]> > > wrote: > > > > > > > Hi, > > > > > > > > > > > > > > > > We integrated Solr to ManifoldCF. > > > > > > > > We configured Solr to use the OCR engine. > > > > > > > > > > > > > > > > When we crawl documents MCF reads the docs fine and submit them > > > > to > > Solr. > > > > > > > > > > > > > > > > It happens on large files (PDF, images) that the OCR takes too > > > > long which leads to MCF request to fail. > > > > > > > > > > > > > > > > The annoying thing is that MCF does not ignore the file. > > > > > > > > On the next crawling, the file keeps failing. > > > > > > > > > > > > > > > > How could I tell manifold to skip the file that fails ? > > > > > > > > > > > > > > > > Thanks for your reply. > > > > > > > > > > > > > > > > [image: TOP 250 des éditeurs] > > > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f > > > > 28 > > > > 30 > > > > 87 > > > > b34/undefined> > > > > > > > > [image: Logo] > > > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-873 > > > > 0e > > > > ac > > > > 1b > > > > 836/undefined> > > > > > > > > *Suivez-nous !* > > > > > > > > [image: Linkedin] > > > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8 > > > > 73 > > > > 8a > > > > fa > > > > 52f/undefined> > > > > > > > > [image: Viadeo] > > > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec > > > > 6f > > > > 46 > > > > 3f > > > > e83/undefined> > > > > > > > > [image: Twitter] > > > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb > > > > 9d > > > > 3b > > > > 26 > > > > d01/undefined> > > > > > > > > [image: Googleplus] > > > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365 > > > > a1 > > > > 99 > > > > 76 > > > > f79/undefined> > > > > > > > > *Frédéric OLIER** | Responsable de la planification stratégique* > > > > > > > > * 33 442 016 891 33 662 635 031* > > > > > > > > *WOOXO* > > > > Tél : 0811 140 160 > > > > Fax0811 481 507 > > > > Immeuble Le Forum - Bât A - 3ème étage > > > > 515 av. de la Tramontane > > > > ZAC Athélia IV > > > > 13600 LA CIOTAT > > > > FRANCE > > > > > > > > > > > > > > > > > > > > > > > > > >
