[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944498#comment-16944498 ]
Karl Wright commented on CONNECTORS-1625: ----------------------------------------- [~DonaldVdD], MCF 2.12 uses Tika 1.19.1. There is no magic in ManifoldCF; it simply calls the parse API for Tika. If that is blowing up then it's Tika that's blowing up. MCF uses disk storage in some cases for large documents when (for example) there is more than one output in a pipeline. That's not the problem here I would imagine. It otherwise uses streaming and does not put documents into memory at all, unless you have a badly-behaved connector involved. The only one we ship with this problem is the Solr Connector, which requires that the entire document be fit into memory if you are not using Solr Cell. That is why we insist that you set a document size limit when you operate the Solr Connector in this mode. I do recall that Tika v. 1.19.1 had a specific problem with memory usage for some kinds of documents; it would probably be worthwhile trying the current release to see if it has the same behavior. > When processing a specific PDF Manifold goes out of memory > ---------------------------------------------------------- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor > Affects Versions: ManifoldCF 2.12 > Reporter: Donald Van den Driessche > Assignee: Karl Wright > Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)