[
https://issues.apache.org/jira/browse/ODFTOOLKIT-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nimarukan updated ODFTOOLKIT-424:
---------------------------------
Attachment: Issue-424-patches.zip
Refreshed patch 3b (sorry, missed in earlier version).
> PERFORMANCE: Reuse DOMRDFaParser, currently created too frequently during
> insertNode and setTextContent.
> --------------------------------------------------------------------------------------------------------
>
> Key: ODFTOOLKIT-424
> URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-424
> Project: ODF Toolkit
> Issue Type: Improvement
> Components: odfdom
> Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77,
> MSWin7
> Reporter: Nimarukan
> Priority: Minor
> Labels: performance
> Attachments: Issue-424-patches.zip
>
>
> https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
> This spreadsheet filling and copying test runs more slowly than desired.
> One problem that contributes to the slow performance is that
> DOMRDFaParser is created too frequently, potentially for every cell,
> during insertNode and setTextContent.
> Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and
> XMLEventFactory, can reduce the run time to 36-40% of the current runtime on
> 0.8.11 snapshot.
> h4. DIAGNOSIS
> Profiling the test (in NetBeans) reveals the filesystem is being accessed
> proportional to the number of cells.
> A culprit appears to be that
> org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node)
> is called by
> -
> org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode
> ()
> -
> org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode
> ()
> -
> org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent
> (String)
> On every call updateInContentMetaDataCache calls
> - org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
> which in turn calls
> -- javax.xml.stream.XMLEventFactory.newInstance()
> -- javax.xml.stream.XMLOutputFactory.newInstance()
> -- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory,
> XMLEVentFactory, URIExtractor)
> which in turn calls
> --- org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory,
> XMLEVentFactory, URIExtractor)
> which in turn calls
> ---- net.rootdev.javardfa.Parser(StatementSink)
> which in turn AGAIN calls
> ----- javax.xml.stream.XMLEventFactory.newInstance()
> ----- javax.xml.stream.XMLOutputFactory.newInstance()
> The newInstance methods are expensive because they access the file system
> looking for service providers. Repeating such expensive calls for every cell
> is one reason this test case is slow.
> h4. POSSIBLE FIXES:
> h5. A. One approach is to address three opportunities to improve:
> - 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and
> URIExtractor with its superclass javardfa.Parser, so that they are not
> duplicated within each parser.
> - 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like
> URIExtractor, so they do not have to be recreated on every call that creates
> a SAXRDFaParser or DOMRDFaParser.
> - 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do
> not have to be recreated on every call that needs an RDFa parser.
> h5. B. Another approach is to do 1 and 2 without 3.
> This will eliminate repeatedly constructing the expensive factories for
> each cell, but leaves the repeated construction of the parsers.
> h5. C. A simpler approach is to only reuse DOMRDFaParser:
> This will greatly reduce the frequency of creating the factories (including
> the duplicate ones). (SAXRDFaParser is only called by OdfFileDom in the
> initialize method, so it is not important to reuse.)
> h4. POSSIBLE CRITERIA:
> h5. Reducing computation:
> Rough figures from brief runs (not repeated runs):
> B reduces time to about 40% of the original.
> C reduces time to about 37% of the original.
> A reduces time to about 36% of the original.
> 'A' saves more computation than 'B' or 'C'.
> h5. Minimizing change:
> The RDFa parsers are all code that overrides library superclass
> net.rootdev.javardfa.Parser with some modifications. The superclass was not
> designed for the modifications, so there is a large duplication of code. To
> preserve the similarities between the superclass code and the copied subclass
> code, it may be best to minimize modifications to copied code in the subclass
> parsers.
> Approach 'C' meets this concern better than approach 'A' or 'B'.
> h5. Safety of reusing parsers:
> Code inspection reveals fields:
> - sink: The primary result of the parser is to emitTriples, writing to the
> JenaSink that is already reused.
> - settings: The settings are currently not modified. (In fact
> RDFaParser.settings is not modifiable: the superclass enable/disable methods
> do not write the subclass settings, so the settings in use would not be
> changed even if they were called.)
> - extractor: is already reused.
> - locator: is initialized by setBase, which is called before each parse in
> OdfFileDom.
> - context: also initialized by setBase, which is called before each parse
> in OdfFileDom.
> - literalCollector: contains a stack of states grows on start tags and
> shrinks on end tags. If an exception is thrown for one parse, then it could
> be in an invalid state for the next parse.
> So reusing a parser looks safe as long as a patch resets the
> literalCollector when a reusable parser throws an exception. Since
> LiteralCollector is a library class with private fields, its internal state
> is not accessible to be reset. So the easiest approach may be to assume that
> parse exceptions occur rarely: simply forget the broken parser when an
> exception occurs, and recreate it the next time the parser is needed.
> Approach 'B' is safer than 'A' or 'C' because the parsers are still created
> for every cell.
> ----
> (If these criteria are all important, I suggest approach 'C' unless problems
> are later uncovered, then back off to approach 'B' if they cannot be fixed.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)