[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-08-01 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401938#comment-15401938
 ] 

Egbert commented on TIKA-2045:
--

Thanks for investigating and reporting it with PDFBox. I'll subscribe to 
PDFBOX-3442 to keep track of a possible solution!

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397504#comment-15397504
 ] 

Egbert edited comment on TIKA-2045 at 7/28/16 1:07 PM:
---

That's what I thought, however, the PDFBox FAQ says it can't just be ignored:

https://pdfbox.apache.org/1.8/faq.html

It says:

PDF documents have certain security permissions that can be applied to them and 
two passwords associated with them, a user password and a master password. If 
the “cannot extract text” permission bit is set then you need to decrypt the 
document with the master password in order to extract the text.

So that means that unless you provide a password, it will not extract text from 
this document. TIKA may be attempting to do something smart, but I really 
wouldn't know where to be looking for that.


was (Author: madegg):
That's what I thought, however, the PDFBox FAQ says it can't just be ignored:

https://pdfbox.apache.org/1.8/faq.html

[quote]PDF documents have certain security permissions that can be applied to 
them and two passwords associated with them, a user password and a master 
password. If the “cannot extract text” permission bit is set then you need to 
decrypt the document with the master password in order to extract the 
text.[/quote]

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397504#comment-15397504
 ] 

Egbert commented on TIKA-2045:
--

That's what I thought, however, the PDFBox FAQ says it can't just be ignored:

https://pdfbox.apache.org/1.8/faq.html

[quote]PDF documents have certain security permissions that can be applied to 
them and two passwords associated with them, a user password and a master 
password. If the “cannot extract text” permission bit is set then you need to 
decrypt the document with the master password in order to extract the 
text.[/quote]

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397450#comment-15397450
 ] 

Egbert commented on TIKA-2045:
--

Ah, sorry. I must have missed that. I just tried with pdfbox-app-2.0.2.jar. 
ExtractText fails with: 

> Exception in thread "main" java.io.IOException: You do not have permission to 
> extract text

Which is perfectly acceptable as far as I'm concerned; it's giving this 
response right away rather than munching on it for several minutes and throwing 
an OOM error.

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397425#comment-15397425
 ] 

Egbert commented on TIKA-2045:
--

Update: I just added -Xmx8G to the java command line for tika-app-1.13.jar and 
then it is able to generate some results. However, it is consuming 8 threads at 
100% on my laptop while parsing the file, which seems like a little bit too 
much for a simple PDF, so I'm guessing it is somehow doing work it shouldn't be 
doing.

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Egbert (JIRA)
Egbert created TIKA-2045:


 Summary: TIKA crashes / runs out of memory on simple PDF
 Key: TIKA-2045
 URL: https://issues.apache.org/jira/browse/TIKA-2045
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.13
 Environment: Linux, Java 8
Reporter: Egbert


We're using TIKA embedded in a webcrawler and today I've encountered a PDF that 
results in OutOfMemory errors while being processed by TIKA.

It's a small, 1 page PDF file, so I don't think that it should consume that 
much memory.

I verified the problem by using the GUI from the tika-app-1.13.jar file and 
that results in the same error on the same file. The file can be found at:

http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf

If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

2016-06-08 Thread Egbert (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320580#comment-15320580
 ] 

Egbert commented on TIKA-1999:
--

I'm sorry, I don't really know what the effect of the limit would be. I am 
using Tika to extract plain text from PDF documents to be able to import them 
into a search index, so I do not have a lot of interest in the metadata.

I'll try your suggested workaround to increase the stack size. Thanks!

> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> --
>
> Key: TIKA-1999
> URL: https://issues.apache.org/jira/browse/TIKA-1999
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Ubuntu 16.04 (64 bit)
> Oracle Java 1.8.0_91-b14 (64 bit)
>Reporter: Egbert
>Assignee: Tim Allison
>
> When trying to read the following PDF document:
> http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf
> TIKA crashes for me with a java.lang.StackOverflowError, caused by a large 
> number of recursion in:
> {noformat}
> at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> {noformat}
> For some reason, the Tika App doesn't exhibit this behavior, but the 
> following MWE exposes the issue for me:
> {noformat}
> import java.io.ByteArrayOutputStream;
> import java.io.File;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.ToHTMLContentHandler;
> public class test
> {
> public static void main(String [] args) throws Exception {
> String p = "/home/eggie/faulty_pdf_document.pdf";
> 
> FileInputStream input = new FileInputStream(new File(p));
> AutoDetectParser tk = new AutoDetectParser();
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
> ParseContext pc = new ParseContext();
> System.out.println("Parsing");
> tk.parse(input, handler, new Metadata(), pc);
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1999) org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

2016-06-07 Thread Egbert (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egbert updated TIKA-1999:
-
Description: 
When trying to read the following PDF document:

http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf

TIKA crashes for me with a java.lang.StackOverflowError, caused by a large 
number of recursion in:

{noformat}
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
{noformat}

For some reason, the Tika App doesn't exhibit this behavior, but the following 
MWE exposes the issue for me:

{noformat}
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.ToHTMLContentHandler;

public class test
{
public static void main(String [] args) throws Exception {
String p = "/home/eggie/faulty_pdf_document.pdf";

FileInputStream input = new FileInputStream(new File(p));
AutoDetectParser tk = new AutoDetectParser();
ByteArrayOutputStream os = new ByteArrayOutputStream();
ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
ParseContext pc = new ParseContext();
System.out.println("Parsing");
tk.parse(input, handler, new Metadata(), pc);
}
}
{noformat}


  was:
When trying to read the following PDF document:

http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf

TIKA crashes for me with a java.lang.StackOverflowError, caused by a large 
number of recursion in:

at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

For some reason, the Tika App doesn't exhibit this behavior, but the following 
MWE exposes the issue for me:

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.ToHTMLContentHandler;

public class test
{
public static void main(String [] args) throws Exception {
String p = "/home/eggie/faulty_pdf_document.pdf";

FileInputStream input = new FileInputStream(new File(p));
AutoDetectParser tk = new AutoDetectParser();
ByteArrayOutputStream os = new ByteArrayOutputStream();
ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
ParseContext pc = new ParseContext();
System.out.println("Parsing");
tk.parse(input, handler, new Metadata(), pc);
}
}





> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> --
>
> Key: TIKA-1999
> URL: https://issues.apache.org/jira/browse/TIKA-1999
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Ubuntu 16.04 (64 bit)
> Oracle Java 1.8.0_91-b14 (64 bit)
>Reporter: Egbert
>
> When trying to read the following PDF document:
> http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf
> TIKA crashes for me with a java.lang.StackOverflowError, caused by a large 
> number of recursion in:
> {noformat}
> at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> {noformat}
> For some reason, the Tika App doesn't exhibit this behavior, but the 
> following MWE exposes the issue for me:
> {noformat}
> import java.io.ByteArrayOutputStream;
> import java.io.File;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.ToHTMLContentHandler;
> public class test
> {
> public static void main(String [] args) throws Exception {
> String p = "/home/eggie/faulty_pdf_document.pdf";
> 
> FileInputStream input = new FileInputStream(new File(p));
> AutoDetectParser tk = new AutoDetectParser();
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
> ParseContext pc = new ParseContext();
> System.out.println("Parsing");
> tk.parse(input, handler, new Metadata(), pc);
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)