Hi everyone, I updated my project to latest tika trunk recently (since it's got things I need) and quickly noticed that my performance tests show a rather big regression. I tracked it down to new code introduced in this changeset: http://www.mail-archive.com/[email protected]/msg00081.html
Parsing all those SimpleDateFormat strings takes a *long* time. getTimeZone and setTimeZone also show up on profile. In my testing scenario (lots of simple files; multithreading that makes re-use of Metadata objects hard, etc), Metadata.<init> takes about 1/3 of all Tika time, rivaling guessContent and actual parsing in profiler. >From a quick glance, it seems like all DateFormat creation could be static, or >otherwise created up front. Is this correct? Thanks Radek
