Nimarukan created ODFTOOLKIT-424:
------------------------------------
Summary: PERFORMANCE: Reuse DOMRDFaParser, currently created too
frequently during insertNode and setTextContent.
Key: ODFTOOLKIT-424
URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-424
Project: ODF Toolkit
Issue Type: Improvement
Components: odfdom
Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77,
MSWin7
Reporter: Nimarukan
Priority: Minor
https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
This spreadsheet filling and copying test runs more slowly than desired.
One problem that contributes to the slow performance is that
DOMRDFaParser is created too frequently, potentially for every cell,
during insertNode and setTextContent.
Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and
XMLEventFactory, can reduce the run time to 36-40% of the current runtime on
0.8.11 snapshot.
h4. DIAGNOSIS
Profiling the test (in NetBeans) reveals the filesystem is being accessed
proportional to the number of cells.
A culprit appears to be that
org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node)
is called by
- org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode
()
-
org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode
()
-
org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent
(String)
On every call updateInContentMetaDataCache calls
- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
which in turn calls
-- javax.xml.stream.XMLEventFactory.newInstance()
-- javax.xml.stream.XMLOutputFactory.newInstance()
-- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory,
XMLEVentFactory, URIExtractor)
which in turn calls
--- org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory,
XMLEVentFactory, URIExtractor)
which in turn calls
---- net.rootdev.javardfa.Parser(StatementSink)
which in turn AGAIN calls
----- javax.xml.stream.XMLEventFactory.newInstance()
----- javax.xml.stream.XMLOutputFactory.newInstance()
The newInstance methods are expensive because they access the file system
looking for service providers. Repeating such expensive calls for every cell
is one reason this test case is slow.
h4. POSSIBLE FIXES:
h5. A. One approach is to address three opportunities to improve:
- 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and
URIExtractor with its superclass javardfa.Parser, so that they are not
duplicated within each parser.
- 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like
URIExtractor, so they do not have to be recreated on every call that creates a
SAXRDFaParser or DOMRDFaParser.
- 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do
not have to be recreated on every call that needs an RDFa parser.
h5. B. Another approach is to do 1 and 2 without 3.
This will eliminate repeatedly constructing the expensive factories for each
cell, but leaves the repeated construction of the parsers.
h5. C. A simpler approach is to only reuse DOMRDFaParser:
This will greatly reduce the frequency of creating the factories (including
the duplicate ones). (SAXRDFaParser is only called by OdfFileDom in the
initialize method, so it is not important to reuse.)
h4. POSSIBLE CRITERIA:
h5. Reducing computation:
Rough figures from brief runs (not repeated runs):
B reduces time to about 40% of the original.
C reduces time to about 37% of the original.
A reduces time to about 36% of the original.
'A' saves more computation than 'B' or 'C'.
h5. Minimizing change:
The RDFa parsers are all code that overrides library superclass
net.rootdev.javardfa.Parser with some modifications. The superclass was not
designed for the modifications, so there is a large duplication of code. To
preserve the similarities between the superclass code and the copied subclass
code, it may be best to minimize modifications to copied code in the subclass
parsers.
Approach 'C' meets this concern better than approach 'A' or 'B'.
h5. Safety of reusing parsers:
Code inspection reveals fields:
- sink: The primary result of the parser is to emitTriples, writing to the
JenaSink that is already reused.
- settings: The settings are currently not modified. (In fact
RDFaParser.settings is not modifiable: the superclass enable/disable methods do
not write the subclass settings, so the settings in use would not be changed
even if they were called.)
- extractor: is already reused.
- locator: is initialized by setBase, which is called before each parse in
OdfFileDom.
- context: also initialized by setBase, which is called before each parse in
OdfFileDom.
- literalCollector: contains a stack of states grows on start tags and
shrinks on end tags. If an exception is thrown for one parse, then it could be
in an invalid state for the next parse.
So reusing a parser looks safe as long as a patch resets the literalCollector
when a reusable parser throws an exception. Since LiteralCollector is a
library class with private fields, its internal state is not accessible to be
reset. So the easiest approach may be to assume that parse exceptions occur
rarely: simply forget the broken parser when an exception occurs, and recreate
it the next time the parser is needed.
Approach 'B' is safer than 'A' or 'C' because the parsers are still created
for every cell.
----
(If these criteria are all important, I suggest approach 'C' unless problems
are later uncovered, then back off to approach 'B' if they cannot be fixed.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)