Hi,

Am 24.11.2014 um 14:57 schrieb Allison, Timothy B.:
Andreas,

Sounds good.  If you could ping me on TIKA-1442, I'll be sure to hear the 
message in a timely fashion. :)
Done!

I just tried to build Tika with 1.8.8-SNAPSHOT, and I found a problem with the 
non-sequential parser on one of our test files 
(http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/testPDF_protected.pdf).

This is the stacktrace with pdfbox-app-1.8.8-20141124.081221-143.jar's 
ExtractText -nonSeq:
I've added my changes and fixed the described issue as well.

BR
Andreas Lehmkühler

Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
ExtractText failed with the following exception:
java.io.IOException
         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
         at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:22
5)
         at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.ja
va:117)
         at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:251)
         at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:235)
         at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:215)
         at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
va:480)
         at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.j
ava:405)
         at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java
:364)
         at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
         at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
         at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Caused by: java.util.zip.DataFormatException: incorrect header check
         at java.util.zip.Inflater.inflateBytes(Native Method)
         at java.util.zip.Inflater.inflate(Inflater.java:259)
         at java.util.zip.Inflater.inflate(Inflater.java:280)
         at 
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:128)

         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:101)
         ... 13 more

-----Original Message-----
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Monday, November 24, 2014 8:39 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 1.8.8. release

Hi,

"Allison, Timothy B." <talli...@mitre.org> hat am 24. November 2014 um 13:10
geschrieben:


Let me know when to hit "run"...
Thanks for the offer, there is just one thing related to PDFBOX-2430 I'd like to
fix this evening ......

BR
Andras Lehmkühler


-----Original Message-----
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, November 23, 2014 12:27 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 1.8.8. release

Hi,

Am 23.11.2014 um 17:55 schrieb Tilman Hausherr:
Hi.

I'd prefer to wait for the tests of Tim Allison... unless you want to live
with
the risk that he does the tests, and that we find a "big problem" within
that 3
day voting period...
Good point.

I haven't asked him to do these tests yet, because so much work was done on
both
parsers.
I guess I'm done with parser changes at least in the 1.8 branch

Tilman

BR
Andreas Lehmkühler


Am 23.11.2014 um 17:14 schrieb Andreas Lehmkuehler:
Hi,

Am 11.11.2014 um 12:15 schrieb Andreas Lehmkühler:
Hi,

Andreas Lehmkühler <andr...@lehmi.de> hat am 3. November 2014 um 11:52
geschrieben:


Hi,

there are again a number of solved issues and I'm thinking about a new
bugfix release. How about a new one next week, maybe later if someone
wants to get some addtional things done before?
Looks like I won't have the time this week to cut the release, sorry.
I'm not sure if I'll find some time when attending ApacheCon in Budapest
next
week,
but I should have some cycles in the last week of november.

This will buy us some time to fix some of the encryption/decryption
issues.
I'm going to cut the release tomorrow in the evening, round about 24 hours
from now. Any objections?


BR
Andreas Lehmkühler



Reply via email to