Here's the code... it assumes that all PDFs are flat in one single directory. Libraries needed: preflight-app, jai_imageio, levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not with 1.8, because we didn't fix all problems there.
Tilman

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;

/**
 *
 * @author Tilman Hausherr
 */
public class PreflightTest
{
    public static void main(String[] args) throws FileNotFoundException
    {
        File dir;
        if (args.length > 0)
        {
            dir = new File(args[0]);
        }
        else
        {
            dir = new File("k:\\dc");
        }

        int total = 0;
        int failed = 0;
        File[] dirList = dir.listFiles(new FilenameFilter()
        {
            @Override
            public boolean accept(File dir, String name)
            {
if (name.compareTo("000000.pdf") <= 0) // use this to start at a certain file
                {
                    return false;
                }
                return name.toLowerCase().endsWith(".pdf");
            }
        });
        for (File pdf : dirList)
        {
            ++total;
            System.out.println(pdf.getName());
            // just test that it doesn't crash
            try
            {
                new File(pdf.getName() + "-exception.txt").delete();
                PreflightParser parser = new PreflightParser(pdf);
                parser.parse();
try (PreflightDocument preflightDocument = parser.getPreflightDocument())
                {
                    preflightDocument.validate();
                    preflightDocument.getResult();
                }
                parser.clearResources();
            }
            catch (ValidationException e)
            {
            }
            catch (Throwable e)
            {
                ++failed;
try (PrintWriter pw = new PrintWriter(new File(pdf.getName() + "-exception.txt")))
                {
                    e.printStackTrace(pw);
                }
                System.out.flush();
                System.err.flush();
                System.err.print(pdf.getName() + " preflight fail: ");
                e.printStackTrace();
                System.out.flush();
                System.err.flush();
            }
System.out.println("total: " + total + ", failed: " + failed + ", percentage failed: " + (((float) failed) / total * 100.0) + "%");
        }

    }

}


Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.:
Tilman,
   This is fantastic!  If you send me an example of the code you used to call 
preflight (#parse() or  #parse(Format format)???), I'd like to run it within 
tika-batch to see what our batch performance is.
   Ideally, once we can turn our public vm on, it would be fun to run these 
tests there.
          Best,

                     Tim

-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests

Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
Hi Tilman,

that's very good news. I trust a lot of time went into reviewing the test 
results. wo your and Tim's efforts this achievement wouldn't have been possible.

BR

Maruan

Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <thaush...@t-online.de>:

I've now run preflight on half of the govdocs files. Every issue I have opened on 
preflight is related to that test. The failure rate (exceptions other than the 
"allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. 
Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. 
Whats left now are exceptions related to messy files, and some of the font related issues.

Tilman

Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
It is not looking good, there is at least one NPEs issue coming.
No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf 
which is a known problem.

Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to 
see what happens.

Tilman


Reply via email to