Hi Karl,

Indeed, I have this in my logs:

MCF: 

Exception tossed: Repeated service interruptions - failure processing document: 
Read timed out


Solr

Error for /datafari-solr/FileShare/update/extract
java.lang.OutOfMemoryError: Java heap space


The file is not that big (7M).

Although ignoring the file might not be the 'nicest' solution, is that possible 
?

I'll investigate on Solr / Tika side to see if I can deactivate the recursive 
parsing of archive files.

Thanks anyway,
Fred.



-----Message d'origine-----
De : Karl Wright [mailto:[email protected]] 
Envoyé : jeudi 22 octobre 2015 18:16
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Hi Fred,

I suspect that you are getting an out-of-memory or out-of-disk error on the 
Solr side.  That's really bad and you don't just want to make ManifoldCF ignore 
it.

What you can do is limit the maximum size file sent to Solr.  That's a far 
better fix.

Karl


On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <[email protected]> wrote:

> Hi,
>
> I managed to progress on my issues.
>
> The document (docx) is now skipped as expected when it fails.
>
> However, I have now another issue.
> I have a tar.gz file containing itself 100+ tar.gz files.
>
> ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> I looked at the Solr configuration and due to the hardware used I 
> won't be able to tweak more the JVM and so on.
>
> Therefore I'd like to know whether ManifoldCF can be configured to 
> skipped files for which it gets such an error instead of aborting ?
>
> Fred.​
>
>
> -----Message d'origine-----
> De : Frédéric Olier [mailto:[email protected]] Envoyé : mercredi 21 
> octobre 2015 17:51 À : [email protected] Objet : RE: [Solr] 
> Error on documents makes ManifoldCF
>
> Hi Karl,
>
> Many thanks.
>
> I found the configuration to use:
> Here
>
> http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-s
> olr-for-files-search/
>
> Search for "ignoreTikaException"
>
> I'll test it and see if it fixes my issue.
>
> Fred​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:[email protected]] Envoyé : mercredi 21 
> octobre
> 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Standard google searching finds it.
>
> See:
>
>
> http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%
> [email protected]%3E
>
> Karl
>
>
> On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <[email protected]> wrote:
>
> > Hi,
> >
> > Thanks for your reply.
> >
> > I looked here :
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> >
> > But there is no 'search' option...
> >
> > Any idea where I can search what I'm looking for more efficiently ?
> >
> > Thanks​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:[email protected]] Envoyé : mercredi 21 
> > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents 
> > makes ManifoldCF
> >
> > Hi Frédéric,
> >
> > There's a flag in the Solr configuration you can set that will cause 
> > exceptions from Solr Cell (Tika) to cause the document to be skipped 
> > rather than causing ManifoldCF to retry the document.  I don't 
> > remember what it is but others have noted it and you can search the 
> > mail
> archive to find it.
> >
> > Thanks,
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > We integrated Solr to ManifoldCF.
> > >
> > > We configured Solr to use the OCR engine.
> > >
> > >
> > >
> > > When we crawl documents MCF reads the docs fine and submit them to
> Solr.
> > >
> > >
> > >
> > > It happens on large files (PDF, images) that the OCR takes too 
> > > long which leads to MCF request to fail.
> > >
> > >
> > >
> > > The annoying thing is that MCF does not ignore the file.
> > >
> > > On the next crawling, the file keeps failing.
> > >
> > >
> > >
> > > How could I tell manifold to skip the file that fails ?
> > >
> > >
> > >
> > > Thanks for your reply.
> > >
> > >
> > >
> > > [image: TOP 250 des éditeurs]
> > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f28
> > > 30
> > > 87
> > > b34/undefined>
> > >
> > > [image: Logo]
> > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730e
> > > ac
> > > 1b
> > > 836/undefined>
> > >
> > > *Suivez-nous !*
> > >
> > > [image: Linkedin]
> > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b873
> > > 8a
> > > fa
> > > 52f/undefined>
> > >
> > > [image: Viadeo]
> > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f
> > > 46
> > > 3f
> > > e83/undefined>
> > >
> > > [image: Twitter]
> > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d
> > > 3b
> > > 26
> > > d01/undefined>
> > >
> > > [image: Googleplus]
> > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a1
> > > 99
> > > 76
> > > f79/undefined>
> > >
> > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > >
> > > * 33 442 016 891 33 662 635 031*
> > >
> > > *WOOXO*
> > > Tél : 0811 140 160
> > > Fax0811 481 507
> > > Immeuble Le Forum - Bât A - 3ème étage
> > > 515 av. de la Tramontane
> > > ZAC Athélia IV
> > > 13600 LA CIOTAT
> > > FRANCE
> > >
> > >
> > >
> > >
> > >
> >
>

Reply via email to