[ https://issues.apache.org/jira/browse/ODFTOOLKIT-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nimarukan updated ODFTOOLKIT-424: --------------------------------- Attachment: Issue-424-patches.zip Refreshed patch 3b (sorry, missed in earlier version). > PERFORMANCE: Reuse DOMRDFaParser, currently created too frequently during > insertNode and setTextContent. > -------------------------------------------------------------------------------------------------------- > > Key: ODFTOOLKIT-424 > URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-424 > Project: ODF Toolkit > Issue Type: Improvement > Components: odfdom > Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77, > MSWin7 > Reporter: Nimarukan > Priority: Minor > Labels: performance > Attachments: Issue-424-patches.zip > > > https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip > This spreadsheet filling and copying test runs more slowly than desired. > One problem that contributes to the slow performance is that > DOMRDFaParser is created too frequently, potentially for every cell, > during insertNode and setTextContent. > Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and > XMLEventFactory, can reduce the run time to 36-40% of the current runtime on > 0.8.11 snapshot. > h4. DIAGNOSIS > Profiling the test (in NetBeans) reveals the filesystem is being accessed > proportional to the number of cells. > A culprit appears to be that > org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node) > is called by > - > org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode > () > - > org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode > () > - > org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent > (String) > On every call updateInContentMetaDataCache calls > - org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink) > which in turn calls > -- javax.xml.stream.XMLEventFactory.newInstance() > -- javax.xml.stream.XMLOutputFactory.newInstance() > -- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory, > XMLEVentFactory, URIExtractor) > which in turn calls > --- org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory, > XMLEVentFactory, URIExtractor) > which in turn calls > ---- net.rootdev.javardfa.Parser(StatementSink) > which in turn AGAIN calls > ----- javax.xml.stream.XMLEventFactory.newInstance() > ----- javax.xml.stream.XMLOutputFactory.newInstance() > The newInstance methods are expensive because they access the file system > looking for service providers. Repeating such expensive calls for every cell > is one reason this test case is slow. > h4. POSSIBLE FIXES: > h5. A. One approach is to address three opportunities to improve: > - 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and > URIExtractor with its superclass javardfa.Parser, so that they are not > duplicated within each parser. > - 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like > URIExtractor, so they do not have to be recreated on every call that creates > a SAXRDFaParser or DOMRDFaParser. > - 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do > not have to be recreated on every call that needs an RDFa parser. > h5. B. Another approach is to do 1 and 2 without 3. > This will eliminate repeatedly constructing the expensive factories for > each cell, but leaves the repeated construction of the parsers. > h5. C. A simpler approach is to only reuse DOMRDFaParser: > This will greatly reduce the frequency of creating the factories (including > the duplicate ones). (SAXRDFaParser is only called by OdfFileDom in the > initialize method, so it is not important to reuse.) > h4. POSSIBLE CRITERIA: > h5. Reducing computation: > Rough figures from brief runs (not repeated runs): > B reduces time to about 40% of the original. > C reduces time to about 37% of the original. > A reduces time to about 36% of the original. > 'A' saves more computation than 'B' or 'C'. > h5. Minimizing change: > The RDFa parsers are all code that overrides library superclass > net.rootdev.javardfa.Parser with some modifications. The superclass was not > designed for the modifications, so there is a large duplication of code. To > preserve the similarities between the superclass code and the copied subclass > code, it may be best to minimize modifications to copied code in the subclass > parsers. > Approach 'C' meets this concern better than approach 'A' or 'B'. > h5. Safety of reusing parsers: > Code inspection reveals fields: > - sink: The primary result of the parser is to emitTriples, writing to the > JenaSink that is already reused. > - settings: The settings are currently not modified. (In fact > RDFaParser.settings is not modifiable: the superclass enable/disable methods > do not write the subclass settings, so the settings in use would not be > changed even if they were called.) > - extractor: is already reused. > - locator: is initialized by setBase, which is called before each parse in > OdfFileDom. > - context: also initialized by setBase, which is called before each parse > in OdfFileDom. > - literalCollector: contains a stack of states grows on start tags and > shrinks on end tags. If an exception is thrown for one parse, then it could > be in an invalid state for the next parse. > So reusing a parser looks safe as long as a patch resets the > literalCollector when a reusable parser throws an exception. Since > LiteralCollector is a library class with private fields, its internal state > is not accessible to be reset. So the easiest approach may be to assume that > parse exceptions occur rarely: simply forget the broken parser when an > exception occurs, and recreate it the next time the parser is needed. > Approach 'B' is safer than 'A' or 'C' because the parsers are still created > for every cell. > ---- > (If these criteria are all important, I suggest approach 'C' unless problems > are later uncovered, then back off to approach 'B' if they cannot be fixed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)