[jira] [Created] (ODFTOOLKIT-424) PERFORMANCE: Reuse DOMRDFaParser, currently created too frequently during insertNode and setTextContent.

Nimarukan (JIRA) Wed, 06 Apr 2016 16:12:36 -0700

Nimarukan created ODFTOOLKIT-424:
------------------------------------

             Summary: PERFORMANCE: Reuse DOMRDFaParser, currently created too 
frequently during insertNode and setTextContent.
                 Key: ODFTOOLKIT-424
                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-424
             Project: ODF Toolkit
          Issue Type: Improvement
          Components: odfdom
         Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77, 
MSWin7
            Reporter: Nimarukan
            Priority: Minor



https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
This spreadsheet filling and copying test runs more slowly than desired.

One problem that contributes to the slow performance is that
DOMRDFaParser is created too frequently, potentially for every cell,
during insertNode and setTextContent.

Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and 
XMLEventFactory, can reduce the run time to 36-40% of the current runtime on 
0.8.11 snapshot.

h4. DIAGNOSIS

Profiling the test (in NetBeans) reveals the filesystem is being accessed 
proportional to the number of cells.

A culprit appears to be that
org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node)
is called by
- org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode 
()
- 
org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode 
()
- 
org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent 
(String)

On every call updateInContentMetaDataCache calls
- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
which in turn calls
-- javax.xml.stream.XMLEventFactory.newInstance()
-- javax.xml.stream.XMLOutputFactory.newInstance()
-- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory, 
XMLEVentFactory, URIExtractor)
which in turn calls
--- org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory, 
XMLEVentFactory, URIExtractor)
which in turn calls
---- net.rootdev.javardfa.Parser(StatementSink)
which in turn AGAIN calls
----- javax.xml.stream.XMLEventFactory.newInstance()
----- javax.xml.stream.XMLOutputFactory.newInstance()

The newInstance methods are expensive because they access the file system 
looking for service providers.  Repeating such expensive calls for every cell 
is one reason this test case is slow.

h4. POSSIBLE FIXES:

h5. A. One approach is to address three opportunities to improve:

- 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and 
URIExtractor with its superclass javardfa.Parser, so that they are not 
duplicated within each parser.
- 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like 
URIExtractor, so they do not have to be recreated on every call that creates a 
SAXRDFaParser or DOMRDFaParser.
- 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do 
not have to be recreated on every call that needs an RDFa parser.

h5. B. Another approach is to do 1 and 2 without 3.  

  This will eliminate repeatedly constructing the expensive factories for each 
cell, but leaves the repeated construction of the parsers.

h5. C. A simpler approach is to only reuse DOMRDFaParser:

  This will greatly reduce the frequency of creating the factories (including 
the duplicate ones).  (SAXRDFaParser is only called by OdfFileDom in the 
initialize method, so it is not important to reuse.)

h4. POSSIBLE CRITERIA:

h5. Reducing computation:

  Rough figures from brief runs (not repeated runs):
  B reduces time to about 40% of the original.
  C reduces time to about 37% of the original.
  A reduces time to about 36% of the original.

  'A' saves more computation than 'B' or 'C'.

h5. Minimizing change:

  The RDFa parsers are all code that overrides library superclass 
net.rootdev.javardfa.Parser with some modifications.  The superclass was not 
designed for the modifications, so there is a large duplication of code.  To 
preserve the similarities between the superclass code and the copied subclass 
code, it may be best to minimize modifications to copied code in the subclass 
parsers.

  Approach 'C' meets this concern better than approach 'A' or 'B'.

h5. Safety of reusing parsers:

  Code inspection reveals fields:

  - sink: The primary result of the parser is to emitTriples, writing to the 
JenaSink that is already reused.

  - settings: The settings are currently not modified.  (In fact 
RDFaParser.settings is not modifiable: the superclass enable/disable methods do 
not write the subclass settings, so the settings in use would not be changed 
even if they were called.)

  - extractor: is already reused.

  - locator: is initialized by setBase, which is called before each parse in 
OdfFileDom.

  - context: also initialized by setBase, which is called before each parse in 
OdfFileDom.

  - literalCollector: contains a stack of states grows on start tags and 
shrinks on end tags.  If an exception is thrown for one parse, then it could be 
in an invalid state for the next parse.

  So reusing a parser looks safe as long as a patch resets the literalCollector 
when a reusable parser throws an exception.  Since LiteralCollector is a 
library class with private fields, its internal state is not accessible to be 
reset.  So the easiest approach may be to assume that parse exceptions occur 
rarely: simply forget the broken parser when an exception occurs, and recreate 
it the next time the parser is needed.

  Approach 'B' is safer than 'A' or 'C' because the parsers are still created 
for every cell.
  ----
(If these criteria are all important, I suggest approach 'C' unless problems 
are later uncovered, then back off to approach 'B' if they cannot be fixed.)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ODFTOOLKIT-424) PERFORMANCE: Reuse DOMRDFaParser, currently created too frequently during insertNode and setTextContent.

Reply via email to