[ https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Egbert updated TIKA-1999: ------------------------- Description: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: {noformat} at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) {noformat} For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: {noformat} import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } {noformat} was: When trying to read the following PDF document: http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in: at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me: import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.ToHTMLContentHandler; public class test { public static void main(String [] args) throws Exception { String p = "/home/eggie/faulty_pdf_document.pdf"; FileInputStream input = new FileInputStream(new File(p)); AutoDetectParser tk = new AutoDetectParser(); ByteArrayOutputStream os = new ByteArrayOutputStream(); ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); ParseContext pc = new ParseContext(); System.out.println("Parsing"); tk.parse(input, handler, new Metadata(), pc); } } > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > ------------------------------------------------------------------------------------------ > > Key: TIKA-1999 > URL: https://issues.apache.org/jira/browse/TIKA-1999 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.13 > Environment: Ubuntu 16.04 (64 bit) > Oracle Java 1.8.0_91-b14 (64 bit) > Reporter: Egbert > > When trying to read the following PDF document: > http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf > TIKA crashes for me with a java.lang.StackOverflowError, caused by a large > number of recursion in: > {noformat} > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > {noformat} > For some reason, the Tika App doesn't exhibit this behavior, but the > following MWE exposes the issue for me: > {noformat} > import java.io.ByteArrayOutputStream; > import java.io.File; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.sax.ToHTMLContentHandler; > public class test > { > public static void main(String [] args) throws Exception { > String p = "/home/eggie/faulty_pdf_document.pdf"; > > FileInputStream input = new FileInputStream(new File(p)); > AutoDetectParser tk = new AutoDetectParser(); > ByteArrayOutputStream os = new ByteArrayOutputStream(); > ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8"); > ParseContext pc = new ParseContext(); > System.out.println("Parsing"); > tk.parse(input, handler, new Metadata(), pc); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)