[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814330#comment-16814330
 ] 

Donald Van den Driessche edited comment on CONNECTORS-1593 at 4/10/19 11:00 AM:


Hi [~kwri...@metacarta.com]
 The connector is a custom connector. It uses CloseableHttpClient with basic 
authentication.

When we created a project outside manifold to download, we had the same issues. 
About 8% of the downloaded docs had different hashes.
Only when putting about 3 seconds between the 2 downloads of the same file, we 
had 0% of different hashes.


was (Author: donaldvdd):
Hi [~kwri...@metacarta.com]
The connector is a custom connector. It uses CloseableHttpClient with basic 
authentication.

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-19 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796003#comment-16796003
 ] 

Karl Wright edited comment on CONNECTORS-1593 at 3/19/19 11:52 AM:
---

[~DonaldVdD], then from your description it sounds like the problem isn't with 
PDFBox.  It's because you have more worker threads allowed than you have memory 
available.  So if all of the worker threads wind up working on memory-expensive 
documents at the same time, they collide and you run out of memory.

How many worker threads did you allot?  How many Tika transformer connections 
to you have specified as the max?  The real number of simultaneous Tika 
extractions that take place will be the minimum of these two values.





was (Author: kwri...@metacarta.com):
[~DonaldVdD], then from your description it sounds like the problem isn't with 
PDFBox.  It's because you have more worker threads allowed than you have memory 
available.  So if all of the worker threads wind up working on memory-expensive 
documents at the same time, they collide and you run out of memory.

How many worker threads did you allot?



> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
>