Tilman and Andreas, thank you for taking a look!

I agree no need to stop the release.  The improvements far outweigh the small 
regression.

> I had a look at content_diffs_with_exceptions.xlsx, then looking only 
> at govdocs there, all are similar or better.

Y, agreed.  Do we care about these likely broken PDFs from which 2.0.4 appears 
to be able to extract more "common words" than 2.0.5?  

commoncrawl2_likely_broken/OV/OVWMJPQGCK2AQZYVWJWYUPTERPXOGIAD
commoncrawl2_likely_broken/R4/R4P75EJNMNXZC2DQYUFB6BSXQ2CWGVG7.pdf
commoncrawl2_likely_broken/BI/BIVJLJ4QULQQ4VHKKNMBUTKWXAMMN53N.pdf
commoncrawl2_likely_broken/LB/LB6LEZ75Y6OL7SGW7SV6JNO4G6FS7HAS
commoncrawl2_likely_broken/LQ/LQQFDYEI7XTOBMFPSL3IDVKRMUB6YIGU
commoncrawl2_likely_broken/OB/OBQTIKQW3MIEYJPGE4NR5WGPDUZC3ULY
commoncrawl2_likely_broken/BC/BCZSFNQAB62TUBURWG6B3ZOZCG5IH46P
commoncrawl2_likely_broken/TV/TVMANAJVH2VQVABYX6LCVO5KTERLFS2I.pdf

Out of 543,805 PDFs in our test set, and given that they're broken, I'm not 
overly concerned.

-----Original Message-----
From: Andreas Lehmkuehler [mailto:[email protected]] 
Sent: Wednesday, March 15, 2017 5:30 PM
To: [email protected]
Subject: Re: [VOTE] Release Apache PDFBox 2.0.5

Am 15.03.2017 um 19:07 schrieb Tilman Hausherr:
> Thanks Tim!
>
> I looked at newExceptionsInBDetails.xlsx (247 entries). IMHO no need 
> to stop the release, the number of entries in 
> fixedExceptionsInBDetails.xlsx (506) is larger, and the files with exceptions 
> are cut off.
I agree. However, I've checked one of the files 015664.pdf and it looks like an 
regression. I can open it using 2.0.4 but get the described exception with 
2.0.5 :-(

BR
Andreas

> I'll create an issue about these.
>
> I had a look at content_diffs_with_exceptions.xlsx, then looking only 
> at govdocs there, all are similar or better.
>
> Tilman
>
> Am 15.03.2017 um 00:03 schrieb Allison, Timothy B.:
>> +1
>>
>> I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k 
>> files from our regression corpus.
>>
>> I haven't had a chance to do much digging, but I wanted to share what 
>> I had as soon as I had it.
>>
>> Reports are here:
>> https://github.com/tballison/share/blob/master/pdfbox_comparisons/rep
>> orts_pdfbox_2.0.5-rc1.zip
>>
>>
>> Lots more "common words".  Many fewer exceptions.  There may be a 
>> regression that is causing 244 new exceptions, but on balance, the 
>> improvements are impressive.
>>
>>
>> java.io.IOException: Missing root object specification in trailer.
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(C
>> OSParser.java:2169)
>>
>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:222)
>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922)
>>     at
>> ...
>>
>> -----Original Message-----
>> From: Timo Boehme [mailto:[email protected]]
>> Sent: Tuesday, March 14, 2017 9:11 AM
>> To: [email protected]
>> Subject: Re: [VOTE] Release Apache PDFBox 2.0.5
>>
>> Hi,
>>
>> +1
>>
>> Maybe we should add the
>>     -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true
>> setting (introduced with 2.0.4) to the Migration/Getting Started 
>> Web-Pages. I had to look through my emails in order to find it and it 
>> really makes a difference (at least on some systems) if there are a 
>> lot of images on a page - so far we only have the
>>     -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
>> setting documented (which did not help in my case). At least the user 
>> may try it out if rendering gets slow on some pages; it may not be a 
>> good general setting as it also may slow rendering down a bit on pages with 
>> few large images.
>>
>>
>> Best,
>> Timo
>>
>>
>> Am 13.03.2017 um 19:18 schrieb Andreas Lehmkuehler:
>>> Hi,
>>>
>>> a candidate for the PDFBox 2.0.5 release is available at:
>>>
>>>      https://dist.apache.org/repos/dist/dev/pdfbox/2.0.5/
>>>
>>> The release candidate is a zip archive of the sources in:
>>>
>>>      http://svn.apache.org/repos/asf/pdfbox/tags/2.0.5/
>>>
>>> The SHA1 checksum of the archive is
>>> 9521349be859498dfdd0e0f2a5d02b082f097ab1.
>>>
>>> Please vote on releasing this package as Apache PDFBox 2.0.5.
>>> The vote is open for the next 72 hours and passes if a majority of 
>>> at least three +1 PDFBox PMC votes are cast.
>>>
>>>      [ ] +1 Release this package as Apache PDFBox 2.0.5
>>>      [ ] -1 Do not release this package because...
>>>
>>>
>>> Here is my +1
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: [email protected] For 
>>> additional commands, e-mail: [email protected]
>>>
>>
>> --
>> Timo Boehme
>> OntoChem IT Solutions GmbH
>> Blücherstraße 24
>> 06120 Halle (Saale)
>> Germany
>>
>> phone: +49 345 478 047 4        | fax: +49 345 478 047 1
>> email: [email protected] | web: www.ontochem.com
>> HRB 21962 Amtsgericht Stendal   | USt-IdNr.: DE815563824
>> managing director : Lutz Weber
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] For 
>> additional commands, e-mail: [email protected]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] For 
>> additional commands, e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For 
> additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to