Hi, FYI, I am playing with CommonCrawl data for some talk that I plan to do in 2016. As part of this I built a small framework to let me run the POI integrationtest-framework on a large number of documents that I extracted from a number of CommonCrawl-runs. This is somewhat similar to what Tim is doing for Tika, but it focues on POI-related documents.
I tried to use this as a huge regression-check, in this case I compared relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this against newer versions to check for any new regressions. Some statistics: * Overall I processed 829356 POI-related documents * 687506 documents did process fine in both versions! * 140699 documents caused parsing errors in both versions. Many of these are actually invalid documents, wrong file-types, incorrect mime-types, ... so the actuall error rate would be much lower, but it is currently not overly useful to look at these errors without first sorting out all the false-positives. * 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made more documents succeed now, jay! * And finally 306 documents did fail in POI-3.14-beta1 while they processed fine with POI-3.13. However these potential regressions have the following causes: ** aprox 280 of these were caused because we do more checks for HSLF now ** 19 were OOMs that happen in my framework with large documents due to parallel processing ** One document fails Date-parsing where I don't see how it did work before, maybe this is also caused by more testing now ** 5 documents failed due to the new support for multi-part formats and locale id ** One document showed an NPE in HSLFTextParagraph So only the last two look like actual regressions, I will commit fixes together with reproducing files for these two shortly. I store the results into a database, so I can query on the results in various ways: E.g. attached is the list of top 100 exception-messages for the failed files. Let me know if you would like to get a full stacktrace and document for any of those or if you have suggestions for additional queries/checks that we could add here! Dominik.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
