[
https://issues.apache.org/jira/browse/PDFBOX-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972672#comment-14972672
]
Tilman Hausherr commented on PDFBOX-3044:
-----------------------------------------
Thanks, I'll do the same for 1.8. However...
I don't believe in "soft" tests as described. If there is a change, it should
be thought whether the change is an improvement, or a flaw, or harmless. Some
seemingly single change may mean that many real world PDFs no longer extract
properly.
Btw the tests you know are not everything... there are also the tests by Tim
Allison, which are done on several 100000 documnents. There's a new round of
tests having been done (see posting in dev list). I'm currently investigating
the changes, some may be really tricky.
I have no idea where the cweb test came from, or what it was meant to test, it
is probably from long ago. But I am trying to add only small, one page tests on
any new PDFs I add.
> Improve text extraction tests
> -----------------------------
>
> Key: PDFBOX-3044
> URL: https://issues.apache.org/jira/browse/PDFBOX-3044
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.8.10, 1.8.11, 2.0.0
> Reporter: Ben McCann
> Assignee: Tilman Hausherr
>
> By [[email protected]]:
> The files in pdfbox/src/test/resources/input all seem to be UTF16 encoded.
> I'm having a really difficult time using these files with the tools that I
> typically use (git, meld, etc.) Would it be possible to change the encoding
> to UTF8?
> By @Tilman Hausherr
> I'm expanding this as a long term issue to improve the testing of text
> extraction. Todos:
> - don't fail immediately (this makes it easier to create output files the
> first time)
> - make a diff output in output dir
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]