[
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738130#comment-13738130
]
Luca Della Toffola commented on TIKA-1149:
------------------------------------------
I did a quick test with the new patch. By letting {{CompositeParser}} inherit
from {{SimpleParser}} and commenting the current
{{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain ~5%
speedup. I used the same workload as before and I ran Tika with {{-d --text}}.
Obviously all test-cases don't pass also in my case.
> Improve parser lookup performance
> ---------------------------------
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.3, 1.4
> Reporter: Luca Della Toffola
> Priority: Minor
> Labels: performance
> Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch,
> CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid
> recomputing parsers map over and over
> in CompositeParser.getParsers(...) if the context is empty and to cache the
> returned value instead.
> This can be done safely even under the assumption that the media-registry and
> the list of component parsers do change while Tika is executing, by
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the
> cache if both media-registry and the list of component parsers change in the
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files
> (i.e., Java class library + Tika app + other apps), the patch reduces the
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the
> same order of magnitude are found also for smaller workloads.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira