Reuse tagsoup HtmlSchema instance across HtmlParsers (performance improvement)
------------------------------------------------------------------------------
Key: TIKA-528
URL: https://issues.apache.org/jira/browse/TIKA-528
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Bruno Dumon
While parsing a set of small HTML files (email messages), I noticed using a
profiler that about a third of the time was being spent in the construction of
tagsoup's HTMLSchema class.
Since this is (or appears to be to me) simply a data structure, it is thread
safe and can be reused. Fortunately this can be done easily, as shown in the
patch I will attach to this issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.