Hi,
Am 24.11.2014 um 14:57 schrieb Allison, Timothy B.:
Andreas,
Sounds good. If you could ping me on TIKA-1442, I'll be sure to hear the
message in a timely fashion. :)
Done!
I just tried to build Tika with 1.8.8-SNAPSHOT, and I found a problem with the
non-sequential parser on one of our test files
(http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/testPDF_protected.pdf).
This is the stacktrace with pdfbox-app-1.8.8-20141124.081221-143.jar's
ExtractText -nonSeq:
I've added my changes and fixed the described issue as well.
BR
Andreas Lehmkühler
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Nov 24, 2014 8:48:06 AM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
ExtractText failed with the following exception:
java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:22
5)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.ja
va:117)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:215)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
va:480)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.j
ava:405)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java
:364)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Caused by: java.util.zip.DataFormatException: incorrect header check
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:128)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:101)
... 13 more
-----Original Message-----
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Monday, November 24, 2014 8:39 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 1.8.8. release
Hi,
"Allison, Timothy B." <talli...@mitre.org> hat am 24. November 2014 um 13:10
geschrieben:
Let me know when to hit "run"...
Thanks for the offer, there is just one thing related to PDFBOX-2430 I'd like to
fix this evening ......
BR
Andras Lehmkühler
-----Original Message-----
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, November 23, 2014 12:27 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 1.8.8. release
Hi,
Am 23.11.2014 um 17:55 schrieb Tilman Hausherr:
Hi.
I'd prefer to wait for the tests of Tim Allison... unless you want to live
with
the risk that he does the tests, and that we find a "big problem" within
that 3
day voting period...
Good point.
I haven't asked him to do these tests yet, because so much work was done on
both
parsers.
I guess I'm done with parser changes at least in the 1.8 branch
Tilman
BR
Andreas Lehmkühler
Am 23.11.2014 um 17:14 schrieb Andreas Lehmkuehler:
Hi,
Am 11.11.2014 um 12:15 schrieb Andreas Lehmkühler:
Hi,
Andreas Lehmkühler <andr...@lehmi.de> hat am 3. November 2014 um 11:52
geschrieben:
Hi,
there are again a number of solved issues and I'm thinking about a new
bugfix release. How about a new one next week, maybe later if someone
wants to get some addtional things done before?
Looks like I won't have the time this week to cut the release, sorry.
I'm not sure if I'll find some time when attending ApacheCon in Budapest
next
week,
but I should have some cycles in the last week of november.
This will buy us some time to fix some of the encryption/decryption
issues.
I'm going to cut the release tomorrow in the evening, round about 24 hours
from now. Any objections?
BR
Andreas Lehmkühler