[
https://issues.apache.org/jira/browse/PDFBOX-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972642#comment-14972642
]
Ben McCann commented on PDFBOX-3044:
------------------------------------
[~tilman] yes, thank you! These were great changes
Two other things I think we might be able to improve:
* Make the .txt files contain the optimal output instead of the current output.
Then have the test fail only if a certain number of extractions differ. If you
make it worse the test should fail. If you make it better then we should lower
the number that can fail in the future to prevent regressions
* Make the files roughly equivalent in length. cweb.pdf is 28 pages long and
all the rest are 1 page, so the test output is almost entirely dominated by
whether we make this file better or worse
> Improve text extraction tests
> -----------------------------
>
> Key: PDFBOX-3044
> URL: https://issues.apache.org/jira/browse/PDFBOX-3044
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.8.10, 1.8.11, 2.0.0
> Reporter: Ben McCann
> Assignee: Tilman Hausherr
>
> By [[email protected]]:
> The files in pdfbox/src/test/resources/input all seem to be UTF16 encoded.
> I'm having a really difficult time using these files with the tools that I
> typically use (git, meld, etc.) Would it be possible to change the encoding
> to UTF8?
> By @Tilman Hausherr
> I'm expanding this as a long term issue to improve the testing of text
> extraction. Todos:
> - don't fail immediately (this makes it easier to create output files the
> first time)
> - make a diff output in output dir
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]