[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401938#comment-15401938 ] Egbert commented on TIKA-2045: -- Thanks for investigating and reporting it with PDFBox. I'll subscribe to PDFBOX-3442 to keep track of a possible solution! > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397504#comment-15397504 ] Egbert edited comment on TIKA-2045 at 7/28/16 1:07 PM: --- That's what I thought, however, the PDFBox FAQ says it can't just be ignored: https://pdfbox.apache.org/1.8/faq.html It says: PDF documents have certain security permissions that can be applied to them and two passwords associated with them, a user password and a master password. If the “cannot extract text” permission bit is set then you need to decrypt the document with the master password in order to extract the text. So that means that unless you provide a password, it will not extract text from this document. TIKA may be attempting to do something smart, but I really wouldn't know where to be looking for that. was (Author: madegg): That's what I thought, however, the PDFBox FAQ says it can't just be ignored: https://pdfbox.apache.org/1.8/faq.html [quote]PDF documents have certain security permissions that can be applied to them and two passwords associated with them, a user password and a master password. If the “cannot extract text” permission bit is set then you need to decrypt the document with the master password in order to extract the text.[/quote] > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397504#comment-15397504 ] Egbert commented on TIKA-2045: -- That's what I thought, however, the PDFBox FAQ says it can't just be ignored: https://pdfbox.apache.org/1.8/faq.html [quote]PDF documents have certain security permissions that can be applied to them and two passwords associated with them, a user password and a master password. If the “cannot extract text” permission bit is set then you need to decrypt the document with the master password in order to extract the text.[/quote] > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397450#comment-15397450 ] Egbert commented on TIKA-2045: -- Ah, sorry. I must have missed that. I just tried with pdfbox-app-2.0.2.jar. ExtractText fails with: > Exception in thread "main" java.io.IOException: You do not have permission to > extract text Which is perfectly acceptable as far as I'm concerned; it's giving this response right away rather than munching on it for several minutes and throwing an OOM error. > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397425#comment-15397425 ] Egbert commented on TIKA-2045: -- Update: I just added -Xmx8G to the java command line for tika-app-1.13.jar and then it is able to generate some results. However, it is consuming 8 threads at 100% on my laptop while parsing the file, which seems like a little bit too much for a simple PDF, so I'm guessing it is somehow doing work it shouldn't be doing. > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
Egbert created TIKA-2045: Summary: TIKA crashes / runs out of memory on simple PDF Key: TIKA-2045 URL: https://issues.apache.org/jira/browse/TIKA-2045 Project: Tika Issue Type: Bug Components: core Affects Versions: 1.13 Environment: Linux, Java 8 Reporter: Egbert We're using TIKA embedded in a webcrawler and today I've encountered a PDF that results in OutOfMemory errors while being processed by TIKA. It's a small, 1 page PDF file, so I don't think that it should consume that much memory. I verified the problem by using the GUI from the tika-app-1.13.jar file and that results in the same error on the same file. The file can be found at: http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320580#comment-15320580 ] Egbert commented on TIKA-1999: -- I'm sorry, I don't really know what the effect of the limit would be. I am using Tika to extract plain text from PDF documents to be able to import them into a search index, so I do not have a lot of interest in the metadata. I'll try your suggested workaround to increase the stack size. Thanks! > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > -- > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) >Reporter: Egbert >Assignee: Tim Allison > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egbert updated TIKA-1999: - Description: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: {noformat} at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) {noformat} For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: {noformat} import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } {noformat} was: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > -- > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) >Reporter: Egbert > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)