[ 
https://issues.apache.org/jira/browse/ODFTOOLKIT-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nimarukan updated ODFTOOLKIT-424:
---------------------------------
    Attachment: Issue-424-patches.zip

Refreshed patch 3b (sorry, missed in earlier version).

> PERFORMANCE: Reuse DOMRDFaParser, currently created too frequently during 
> insertNode and setTextContent.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: ODFTOOLKIT-424
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-424
>             Project: ODF Toolkit
>          Issue Type: Improvement
>          Components: odfdom
>         Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77, 
> MSWin7
>            Reporter: Nimarukan
>            Priority: Minor
>              Labels: performance
>         Attachments: Issue-424-patches.zip
>
>
> https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
> This spreadsheet filling and copying test runs more slowly than desired.
> One problem that contributes to the slow performance is that
> DOMRDFaParser is created too frequently, potentially for every cell,
> during insertNode and setTextContent.
> Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and 
> XMLEventFactory, can reduce the run time to 36-40% of the current runtime on 
> 0.8.11 snapshot.
> h4. DIAGNOSIS
> Profiling the test (in NetBeans) reveals the filesystem is being accessed 
> proportional to the number of cells.
> A culprit appears to be that
> org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node)
> is called by
> - 
> org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode 
> ()
> - 
> org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode
>  ()
> - 
> org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent
>  (String)
> On every call updateInContentMetaDataCache calls
> - org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
> which in turn calls
> -- javax.xml.stream.XMLEventFactory.newInstance()
> -- javax.xml.stream.XMLOutputFactory.newInstance()
> -- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory, 
> XMLEVentFactory, URIExtractor)
> which in turn calls
> --- org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory, 
> XMLEVentFactory, URIExtractor)
> which in turn calls
> ---- net.rootdev.javardfa.Parser(StatementSink)
> which in turn AGAIN calls
> ----- javax.xml.stream.XMLEventFactory.newInstance()
> ----- javax.xml.stream.XMLOutputFactory.newInstance()
> The newInstance methods are expensive because they access the file system 
> looking for service providers.  Repeating such expensive calls for every cell 
> is one reason this test case is slow.
> h4. POSSIBLE FIXES:
> h5. A. One approach is to address three opportunities to improve:
> - 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and 
> URIExtractor with its superclass javardfa.Parser, so that they are not 
> duplicated within each parser.
> - 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like 
> URIExtractor, so they do not have to be recreated on every call that creates 
> a SAXRDFaParser or DOMRDFaParser.
> - 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do 
> not have to be recreated on every call that needs an RDFa parser.
> h5. B. Another approach is to do 1 and 2 without 3.  
>   This will eliminate repeatedly constructing the expensive factories for 
> each cell, but leaves the repeated construction of the parsers.
> h5. C. A simpler approach is to only reuse DOMRDFaParser:
>   This will greatly reduce the frequency of creating the factories (including 
> the duplicate ones).  (SAXRDFaParser is only called by OdfFileDom in the 
> initialize method, so it is not important to reuse.)
> h4. POSSIBLE CRITERIA:
> h5. Reducing computation:
>   Rough figures from brief runs (not repeated runs):
>   B reduces time to about 40% of the original.
>   C reduces time to about 37% of the original.
>   A reduces time to about 36% of the original.
>   'A' saves more computation than 'B' or 'C'.
> h5. Minimizing change:
>   The RDFa parsers are all code that overrides library superclass 
> net.rootdev.javardfa.Parser with some modifications.  The superclass was not 
> designed for the modifications, so there is a large duplication of code.  To 
> preserve the similarities between the superclass code and the copied subclass 
> code, it may be best to minimize modifications to copied code in the subclass 
> parsers.
>   Approach 'C' meets this concern better than approach 'A' or 'B'.
> h5. Safety of reusing parsers:
>   Code inspection reveals fields:
>   - sink: The primary result of the parser is to emitTriples, writing to the 
> JenaSink that is already reused.
>   - settings: The settings are currently not modified.  (In fact 
> RDFaParser.settings is not modifiable: the superclass enable/disable methods 
> do not write the subclass settings, so the settings in use would not be 
> changed even if they were called.)
>   - extractor: is already reused.
>   - locator: is initialized by setBase, which is called before each parse in 
> OdfFileDom.
>   - context: also initialized by setBase, which is called before each parse 
> in OdfFileDom.
>   - literalCollector: contains a stack of states grows on start tags and 
> shrinks on end tags.  If an exception is thrown for one parse, then it could 
> be in an invalid state for the next parse.
>   So reusing a parser looks safe as long as a patch resets the 
> literalCollector when a reusable parser throws an exception.  Since 
> LiteralCollector is a library class with private fields, its internal state 
> is not accessible to be reset.  So the easiest approach may be to assume that 
> parse exceptions occur rarely: simply forget the broken parser when an 
> exception occurs, and recreate it the next time the parser is needed.
>   Approach 'B' is safer than 'A' or 'C' because the parsers are still created 
> for every cell.
>   ----
> (If these criteria are all important, I suggest approach 'C' unless problems 
> are later uncovered, then back off to approach 'B' if they cannot be fixed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to