[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944498#comment-16944498
 ] 

Karl Wright commented on CONNECTORS-1625:
-----------------------------------------

[~DonaldVdD], MCF 2.12 uses Tika 1.19.1.  There is no magic in ManifoldCF; it 
simply calls the parse API for Tika.  If that is blowing up then it's Tika 
that's blowing up.

MCF uses disk storage in some cases for large documents when (for example) 
there is more than one output in a pipeline.  That's not the problem here I 
would imagine.  It otherwise uses streaming and does not put documents into 
memory at all, unless you have a badly-behaved connector involved.  The only 
one we ship with this problem is the Solr Connector, which requires that the 
entire document be fit into memory if you are not using Solr Cell.  That is why 
we insist that you set a document size limit when you operate the Solr 
Connector in this mode.

I do recall that Tika v. 1.19.1 had a specific problem with memory usage for 
some kinds of documents; it would probably be worthwhile trying the current 
release to see if it has the same behavior.


> When processing a specific PDF Manifold goes out of memory
> ----------------------------------------------------------
>
>                 Key: CONNECTORS-1625
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.12
>            Reporter: Donald Van den Driessche
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to