d) is not a problem. It was caused by a bit of idiocy in my random file selection code that allowed for duplicate files...so the list did have 500k file names, but only ~270k unique file names.
On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <[email protected]> wrote: > All, > I finished the regression tests, and the reports are available here: > http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz > My takeaways: > a) we need to fix the new code in the PDFParser that set's whether or > not there is a digital signature. That should be set, not add > b) we are getting a few new exceptions on going over the safety maximum > for byte array allocation in POI. We can make this configurable at the > Tika level. > c) there are a few new problems with EMF parsing, but these won't harm > parsing the rest of the file. > d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but > there were ~500k in the list...I need to figure out what went wrong. > > If I find nothing concerning on d), are we ready to roll 1.23-rc1? > > Cheers, > > Tim > > On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <[email protected]> wrote: > >> All, >> I started the regression tests on a random set of 500k files. I found >> this morning that it was _still_ going. It turns out I had accidentally >> configured extract images for PDFs, which adds to the processing time and >> leads to more OOMs. >> I restarted the regression tests this morning with that feature turned >> off. >> >> Best, >> >> Tim >> >
