Tilman,
This is fantastic! If you send me an example of the code you used to call
preflight (#parse() or #parse(Format format)???), I'd like to run it within
tika-batch to see what our batch performance is.
Ideally, once we can turn our public vm on, it would be fun to run these
tests there.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Friday, December 05, 2014 2:45 PM
To: [email protected]
Subject: Re: preflight mass tests
Some numbers... it took 4-5 days
total: 231223, failed: 142, percentage failed: 0.06141257472336292
Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.
about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors
The rest is mostly related to very broken PDF files.
Tilman
Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test
> results. wo your and Tim's efforts this achievement wouldn't have been
> possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <[email protected]>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have
>> opened on preflight is related to that test. The failure rate (exceptions
>> other than the "allowed" ValidationExceptions) is down from 1% when I
>> started to 0.05% now. Most of the frequent exceptions (e.g. the one with
>> NonTermimalField) have been fixed. Whats left now are exceptions related to
>> messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except
>>> PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora
>>> to see what happens.
>>>
>>> Tilman
>>>
>