[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

Tilman Hausherr (JIRA) Sat, 24 Oct 2015 09:24:00 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972672#comment-14972672
 ]


Tilman Hausherr commented on PDFBOX-3044:
-----------------------------------------

Thanks, I'll do the same for 1.8. However...

I don't believe in "soft" tests as described. If there is a change, it should 
be thought whether the change is an improvement, or a flaw, or harmless. Some 
seemingly single change may mean that many real world PDFs no longer extract 
properly.

Btw the tests you know are not everything... there are also the tests by Tim 
Allison, which are done on several 100000 documnents. There's a new round of 
tests having been done (see posting in dev list). I'm currently investigating 
the changes, some may be really tricky.

I have no idea where the cweb test came from, or what it was meant to test, it 
is probably from long ago. But I am trying to add only small, one page tests on 
any new PDFs I add.

> Improve text extraction tests
> -----------------------------
>
>                 Key: PDFBOX-3044
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3044
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>            Reporter: Ben McCann
>            Assignee: Tilman Hausherr
>
> By [[email protected]]:
> The files in pdfbox/src/test/resources/input all seem to be UTF16 encoded. 
> I'm having a really difficult time using these files with the tools that I 
> typically use (git, meld, etc.)  Would it be possible to change the encoding 
> to UTF8?
> By @Tilman Hausherr
> I'm expanding this as a long term issue to improve the testing of text 
> extraction. Todos:
> - don't fail immediately (this makes it easier to create output files the 
> first time)
> - make a diff output in output dir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

Reply via email to