Luca Della Toffola created TIKA-1149:
----------------------------------------
Summary: 12% performance improvement by caching in CompositeParser
Key: TIKA-1149
URL: https://issues.apache.org/jira/browse/TIKA-1149
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.4, 1.3
Reporter: Luca Della Toffola
Priority: Minor
We found an easy way to improve Tika's performance. The idea is to avoid
recomputing parsers map over and over
in CompositeParser.getParsers(...) if the context is empty and to cache the
returned value instead.
This can be done safely even under the assumption that the media-registry and
the list of component parsers do change while Tika is executing, by
invalidating the cache in the case.
Our attached patch computes the parsers map once per instance of
CompositeParser.
The patch checks for the case where the context is empty and invalidates the
cache if both media-registry and the list of component parsers change in the
corresponding setters.
For example, when running Tika 1.3 on a set of large (~50k classes) JAR files
(i.e., Java class library + Tika app + other apps), the patch reduces the
running time
from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same
order of magnitude are found also for smaller workloads.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira