[jira] [Comment Edited] (PDFBOX-3019) Optimize tolerance settings

Tilman Hausherr (JIRA) Wed, 14 Oct 2015 11:32:41 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957472#comment-14957472
 ]


Tilman Hausherr edited comment on PDFBOX-3019 at 10/14/15 6:31 PM:
-------------------------------------------------------------------

Here's a way to test for differences when building:
1) copy pdf files in the directory {{PDFBox 
reactor\pdfbox\src\test\resources\input}}
2) in TestTextStripper.java near the line {{if (!expectedFile.exists())}} 
deactivate the fail() call, to prevent stop at first failure
3) run the test (it will fail)
4) copy the new txt files from {{PDFBox reactor\pdfbox\target\test-output}} to  
{{PDFBox reactor\pdfbox\src\test\resources\input}}
5) make changes to test regressions
This can also be done across versions, i.e. fill the txt files with an old 
version, then run with the new one.


was (Author: tilman):
Here's a way to test for differences:
1) copy pdf files in the directory {{PDFBox 
reactor\pdfbox\src\test\resources\input}}
2) in TestTextStripper.java near the line {{if (!expectedFile.exists())}} 
deactivate the fail() call, to prevent stop at first failure
3) run the test (it will fail)
4) copy the new txt files from {{PDFBox reactor\pdfbox\target\test-output}} to  
{{PDFBox reactor\pdfbox\src\test\resources\input}}
5) make changes to test regressions
This can also be done across versions, i.e. fill the txt files with an old 
version, then run with the new one.

> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>         Attachments: jbl-example-com.pdf
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3019) Optimize tolerance settings

Reply via email to