[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-10-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944574#comment-14944574
 ] 

Tilman Hausherr commented on TIKA-1737:
---

And I'd be interested to hear whether the situation described at the beginning 
has improved or not.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-10-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944295#comment-14944295
 ] 

Tim Allison commented on TIKA-1737:
---

[~alanbur], over on TIKA-1285, I posted a link for my github fork of Tika that 
includes a PDFBox 2.0 branch.  There's still some cleanup to do, but it should 
be a decent start.

If you have any interest in testing that out, any and all feedback would be 
welcomed.  Thank you!

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903422#comment-14903422
 ] 

Alan Burlison commented on TIKA-1737:
-

bq. Re the ArrayIndexOutOfBoundsException - are you using multithreading? I 
wonder if it is possibly related to PDFBOX-2824. That was fixed in the 2.0 
version only.

Yes, the app is MT and it does indeed look very much like PDFBOX-2824

bq. Re the NPE in PDFStreamEngine.java:355 - this is possibly solved in 1.8.11.

OK, thanks - PDFBOX-2987, correct?


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903537#comment-14903537
 ] 

Tilman Hausherr commented on TIKA-1737:
---

No, PDFBOX-2987 is another one I fixed for you. The NPE in 
PDFStreamEngine.java:355 was (hopefully) fixed in PDFBOX-2935. To test this, 
you'd need to use an 1.8.11 snapshot version.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902622#comment-14902622
 ] 

Tim Allison commented on TIKA-1737:
---

Could we have done something at the Tika level to cause this...I wonder?

Does the heap usage jump for every type of exception...that is, if I find any 
old PDF that triggers an exception, do you think I'll see this with Tika 1.10?


Out of curiosity, are you using Tika in the same jvm as Lucene?

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902659#comment-14902659
 ] 

Tim Allison commented on TIKA-1737:
---

bq. dating back as far as 1992

Y, I just confirmed that I can't find any overlapping stacktraces from our 
govdocs1+common crawl corpus.  Thank you for sharing.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902657#comment-14902657
 ] 

Alan Burlison commented on TIKA-1737:
-

.bq Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an 
exception. If you want to 'fix' (?) this at the Tika level I think you'd have 
to do something similar to what I'm doing and create a new PDFBox instance each 
time there's a PDFBox exception.

.bq Does the heap usage jump for every type of exception...that is, if I find 
any old PDF that triggers an exception, do you think I'll see this with Tika 
1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means 
undoing all the workaround code I've added, so it will take a bit for me to do 
that.

.bq Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat 
instance that contains both Lucene indexer and search, where Tika is being used 
for text extraction for the Lucene indexer.


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902528#comment-14902528
 ] 

Tim Allison commented on TIKA-1737:
---

bq.  there were many more that just had a single line of error

Try adding this to your jvm invocation 
{{-JXX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.


bq. the real issue are the horrendous memory leaks caused whenever a PDFBox 
exception is thrown, that's definitely got worse

Have you done the profiling to determine the memory leaks are caused by 
exceptions being thrown?  That's interesting...

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902580#comment-14902580
 ] 

Alan Burlison commented on TIKA-1737:
-

The heap dump is huge and the profiler struggles to cope so I haven't managed 
to do any detailed analysis yet. There is a pool of Tika parser threads that 
are used to handle the corpus, each thread is reused to extract text from 
multiple documents which is then fed into Lucene. With Tika 1.10, every time a 
Tika instance sees an exception from PDFBox the heap usage jumps up and doesn't 
recover, leading to OOM when the index is just a short way through. That 
doesn't happen with Tika 1.5. I've modified the indexer so that rather than 
just logging the Tika exceptions it destroys the relevant Tika instance, does a 
forced GC and then creates a new Tika instance. With Tika 1.10 that keeps the 
heap size within reasonable bounds. To me that seems like pretty conclusive 
proof that PDFBox is leaking when it throws exceptions.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902522#comment-14902522
 ] 

Tim Allison commented on TIKA-1737:
---

Thank you, [~tilman]!

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902835#comment-14902835
 ] 

Tim Allison commented on TIKA-1737:
---

See PDFBOX-2986 for a resource leak discovered through testing against a file 
in Common Crawl that triggered a ttfparser exception that was close to some of 
yours.  I think this didn't affect you because your ttf exceptions are 
triggered within a PDFFile, and the MemoryTTFDataStream would have been used.

bq. It's actually a Tomcat instance that contains both Lucene indexer and 
search, where Tika is being used for text extraction for the Lucene indexer.

Ah, ok, that's right.  Apologies for the repetition with my soapbox in 
TIKA-1471...I realize this is the easiest way to build an app, but Tika can run 
into serious problems, and I'd strongly encourage trying to keep Tika out of 
the same JVM as Lucene if at all possible.  This is not to say we shouldn't fix 
Tika and its dependencies when problems are found!

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900508#comment-14900508
 ] 

Tim Allison commented on TIKA-1737:
---

Thank you for raising this issue.  I don't think we've seen this increase in 
our Common Crawl slice nor govdocs1...this is not to say that I doubt your 
findings!

If you have a chance, would you be able to confirm that you were getting good 
text out of the files before (handful random selection)...sometimes a new 
exception is actually a good thing.

Also, if there is any way to share the triggering docs, that would help the 
PDFBox team, and we can test with PDFBox 2.0-trunk to see how that compares.  
If I were to update my dev Tika wrapper around PDFBox 2.0 on github, would you 
be willing/able to test it on these docs?

[~tilman], do any of these stacktraces look familiar?  


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901502#comment-14901502
 ] 

Tilman Hausherr commented on TIKA-1737:
---

We will definitively not be able to find the cause of memory leaks without the 
files. You'll have to do that yourself, e.g. by running the PDFTextStripper 
with the current version and with an older version, and then profile, and then 
use different revisions to find out when it started to be bad.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901042#comment-14901042
 ] 

Tilman Hausherr commented on TIKA-1737:
---

Some of the exceptions (the classcastexceptions in the 
org.apache.pdfbox.util.operator) have an obvious cause that would be easy to 
prevent. For others I would need to get the PDF files, and I'm not sure that 
these can be fixed in the 1.8 version.

The best would be to create an issue in PDFBox for each class of errors. And 
then track whether the number of unchecked exceptions goes down.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901371#comment-14901371
 ] 

Alan Burlison commented on TIKA-1737:
-

I'll redo the test and compare the outputs but from memory the later PDFBox 
version was successfully indexing slightly more files, despite all the 
exceptions. Unfortunately I can't share the PDFs as they are confidential but 
it's a set of around 5000 PDFs dating back as far as 1992 so I know some of 
them are pretty certain to be non-compliant and they are therefore a bit of a 
torture test. And yes I'd be happy to test an updated version.

Note that the exceptions I attached to the bug are just the ones that had a 
useful stack trace, there were many more that just had a single line of error, 
presumably they are being caught within PDFBox itself.

I should point out that although the increase in exceptions is concerning, the 
real issue are the horrendous memory leaks caused whenever a PDFBox exception 
is thrown, that's definitely got worse. It's not very helpful detecting more 
errors and throwing more exceptions if that just results in even more memory 
being leaked.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)