Here's the code... it assumes that all PDFs are flat in one single
directory. Libraries needed: preflight-app, jai_imageio,
levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not
with 1.8, because we didn't fix all problems there.
Tilman
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;
/**
*
* @author Tilman Hausherr
*/
public class PreflightTest
{
public static void main(String[] args) throws FileNotFoundException
{
File dir;
if (args.length > 0)
{
dir = new File(args[0]);
}
else
{
dir = new File("k:\\dc");
}
int total = 0;
int failed = 0;
File[] dirList = dir.listFiles(new FilenameFilter()
{
@Override
public boolean accept(File dir, String name)
{
if (name.compareTo("000000.pdf") <= 0) // use this to
start at a certain file
{
return false;
}
return name.toLowerCase().endsWith(".pdf");
}
});
for (File pdf : dirList)
{
++total;
System.out.println(pdf.getName());
// just test that it doesn't crash
try
{
new File(pdf.getName() + "-exception.txt").delete();
PreflightParser parser = new PreflightParser(pdf);
parser.parse();
try (PreflightDocument preflightDocument =
parser.getPreflightDocument())
{
preflightDocument.validate();
preflightDocument.getResult();
}
parser.clearResources();
}
catch (ValidationException e)
{
}
catch (Throwable e)
{
++failed;
try (PrintWriter pw = new PrintWriter(new
File(pdf.getName() + "-exception.txt")))
{
e.printStackTrace(pw);
}
System.out.flush();
System.err.flush();
System.err.print(pdf.getName() + " preflight fail: ");
e.printStackTrace();
System.out.flush();
System.err.flush();
}
System.out.println("total: " + total + ", failed: " +
failed + ", percentage failed: " + (((float) failed) / total * 100.0) +
"%");
}
}
}
Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.:
Tilman,
This is fantastic! If you send me an example of the code you used to call
preflight (#parse() or #parse(Format format)???), I'd like to run it within
tika-batch to see what our batch performance is.
Ideally, once we can turn our public vm on, it would be fun to run these
tests there.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests
Some numbers... it took 4-5 days
total: 231223, failed: 142, percentage failed: 0.06141257472336292
Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.
about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors
The rest is mostly related to very broken PDF files.
Tilman
Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
Hi Tilman,
that's very good news. I trust a lot of time went into reviewing the test
results. wo your and Tim's efforts this achievement wouldn't have been possible.
BR
Maruan
Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <thaush...@t-online.de>:
I've now run preflight on half of the govdocs files. Every issue I have opened on
preflight is related to that test. The failure rate (exceptions other than the
"allowed" ValidationExceptions) is down from 1% when I started to 0.05% now.
Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed.
Whats left now are exceptions related to messy files, and some of the font related issues.
Tilman
Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
It is not looking good, there is at least one NPEs issue coming.
No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf
which is a known problem.
Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to
see what happens.
Tilman