Nimarukan created ODFTOOLKIT-434:
------------------------------------

             Summary: PERFORMANCE/SPACE: Reduce memory per table cell
                 Key: ODFTOOLKIT-434
                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-434
             Project: ODF Toolkit
          Issue Type: Improvement
          Components: odfdom
    Affects Versions: 0.6.2-incubating
         Environment: odfdom-java-0.8.11-incubating-SNAPSHOT, 
simple-odf-0.8.2-incubating-SNAPSHOT, jdk1.8.0_79, MSWin7
            Reporter: Nimarukan
            Priority: Minor


h2. PERFORMANCE/SPACE: Reduce memory per table cell

ODFTOOLKIT-333 provides a [test 
case|https://issues.apache.org/jira/secure/attachment/12806838/odftoolkit-333-test.zip]
 with file bigFile.ods, which is 1.3MB in normal compressed form, or ~180MB 
uncompressed.

Reading the file takes 1.5GB or so, which can cause a 64bit JVM with default 
memory settings to run out of memory on a system with less than 6GB RAM 
(assuming default -Xmx size is one quarter system RAM). 

(I ran the test case using simple-odf-0.8.2-inclubating-SNAPTSHOT and 
odfdom-java-0.8.11-incubating-SNAPSHOT from svn trunk, plus patches from 
ODFTOOLKIT-424, approach A, which reduces initial runtime by a factor of 12 or 
so over simpleapi 0.8.1 and odfdom 0.8.10.)

With the changes proposed below, the ODFTOOLKIT-333 test case runs in 25% less 
time with unconstrained memory (java option {{-Xmx3000M}}).  With less memory 
than {{-Xmx2200M}}, the changes produce greater improvement because fewer 
full-gc passes occur.

The changes:
* part1: Precompute OdfName qName
* part2: Use precomputed OdfName parts for table-cell element name, do not 
store new ones.
* part3: Use precomputed OdfName parts for value-type attribute name, do not 
store new ones.
* part4: Use OfficeValueTypeAttribute.Value for value-type attribute value, do 
not store new ones.
* part5: Avoid creating an empty AttributeMap on p elements with no attributes.

These changes reduce the memory requirement by about 20% (1.5GB to 1.2GB).

Contents
- [Initial diagnosis|#InitialDiagnosis]
- [Reduce duplicate element name strings|#ReduceElementNameStrings]
- [Reduce duplicate attribute name strings|#ReduceAttributeNameStrings]
- [Reduce duplicate value type strings|#ReduceValueTypeStrings]
- [Reduce empty attribute maps|#ReduceEmptyAttributeMaps]
- [Table cell memory footprint|#TableCellMemoryFootprint]
** [Users can further reduce memory|#UsersCanFurtherReduceMemory]

{anchor:InitialDiagnosis}
h3. INITIAL DIAGNOSIS

A heap dump during a profiled run showed (in Netbeans) that the top memory uses 
are:

{code}
  6.7M char[]
  6.7M String
  2.7M org.apache.xerces.dom.AttributeMap
  1.3M Object[]
  1.3M Vector
  1.3M org.apache.xerces.dom.TextImpl
  1.3M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
  1.3M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
  1.3M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
  47K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
  47K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}

So it looks like there were about 47K rows holding 1.3M cells.

But why so many Strings?

Each table cell is represented as elements such as:
{code}
<table:table-cell office:value-type="string"><text:p>Test data 
47014</text:p></table:table-cell>
{code}

Browsing the latest {{String}} instances shows a large number of them are:
* element tag name parts like {{"table-cell"}} and {{"p"}},
* attribute name parts like {{"office"}} and {{"value-type"}}
* attribute values like {{"string"}}, 
* and the content string values in the cells, like {{"Test data 47021"}}.

{anchor:ReduceElementNameStrings}
h3. REDUCE DUPLICATE ELEMENT TAG NAME STRINGS

The element tag names {{"table-cell"}} and {{"p"}} should be shared, not 
duplicated for every cell.

{panel}
  1. {{TableTableCellElement}} defines a constant {{ELEMENT_NAME}} which is an 
{{OdfName}}.
  2. {{TableTableCellElement}} passes the {{OdfName}} to 
{{TableTableCellElementBase}}.
  3. {{TableTableCellElementBase}} passes the {{OdfName}} to 
{{OdfStylableElement}}.
  4. {{OdfStylableElement}} passes {{name.getURI()}} and {{name.getQName()}} to 
{{OdfElement}}.

     *CULPRIT 1*: {{OdfName.getQName()}} constructs a new string each time it 
is called, concatentating the namespace prefix and the local name.

  5. {{OdfElement}} passes the {{qName}} to to {{xerces.dom.ElementNSImpl}}.
  6. {{ElementNSImpl(ownerDoc, ns, qname)}} stores the prefix and local name.

     *CULPRIT 2*: {{ElementNSImpl}} creates strings for the prefix and local 
name, checks them, and stores the local name.
{panel}

To avoid creating strings for every element tag qname, prefix, and local name:

{panel:title=part1}
  1. {{OdfName}} needs to precompute the qName.
{panel}

{panel:title=part2}
  4. {{OdfStylableElement(ownerDoc, OdfName, ...)}}
     must call {{OdfElement(ownerDoc, OdfName)}}
     \[not {{OdfElement(ownerDoc, ns, qname)}}]
  5. {{OdfElement(ownerDoc, OdfName)}}
     must call {{ElementNSImpl(ownerDoc, ns, qname, localName)}}
     \[not {{ElementNSImpl(ownerDoc, ns, qname)}}]
{panel}


After this change a profile run showed the following:

{code}
  4.8M char[]
  4.8M String
  3.1M org.apache.xerces.dom.AttributeMap
  1.6M Object[]
  1.6M Vector
  1.5M org.apache.xerces.dom.TextImpl
  1.5M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
  1.5M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
  1.5M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
  55K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
  55K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
(row numbers are larger because snapshot was later in run)

Browsing the latest {{String}} instances shows the element names 
{{"table-cell"}} and {{"p"}} are no longer frequent.


{anchor:ReduceAttributeNameStrings}
h3. REDUCE DUPLICATE ATTRIBUTE NAME STRINGS

A large number of remaining strings are attribute parts like {{"office"}}, 
{{"value-type"}}, {{"string"}}, plus the test string values in the cells, like 
{{"Test data 47021"}}.

Attribute name parts like {{"office"}} and {{"value-type"}} should be shared, 
not duplicated for every cell.

{panel}
     *CULPRIT 3*: {{AttrNSImpl(ownerDoc, ns, qName)}} creates strings for the 
prefix and local name, checks them, and stores the local name.
{panel}

To share the attribute name strings, a similar change is needed:

{panel:title=part3}
  1. {{OdfAttribute(ownerDoc, OdfName)}}
    must call {{AttrNSImpl(ownerDoc, ns, qName, localName)}}
    \[not {{AttrNSImpl(ownerDoc, ns, qName)}}]
{panel}

After adding this change a profile run showed the following:

{code}
  3.4M char[]
  3.4M String
  3.4M org.apache.xerces.dom.AttributeMap
  1.7M Object[]
  1.7M Vector
  1.6M org.apache.xerces.dom.TextImpl
  1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
  1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
  1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
  60K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
  60K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}

Browsing the latest instances shows {{"office"}} and {{"value-type"}} are no 
longer frequent.

{anchor:ReduceValueTypeStrings}
h3. REDUCE DUPLICATE VALUE TYPE STRINGS

The {{value-type}} attribute value {{"string"}} is duplicated for each cell.

To share {{value-type}} attribute value strings, such as {{"string"}} in 
{{office:value-type="string"}}, do not store the string from the input.
Instead, use the value to find the enum {{OfficeValueTypeAttribute.Value}}.

{panel:title=part4}
1. OfficeValueTypeAttribute_setAttribute(stringValue)
   Find enum value with
   OfficeValueTypeAttribute.Value.enumValueOf(stringValue)
   If not null, use its string instead of the stringValue.
{panel}

After adding this change, a profile run showed the following:

{code}
  3.3M org.apache.xerces.dom.AttributeMap
  1.7M char[]
  1.7M String
  1.7M Object[]
  1.7M Vector
  1.6M org.apache.xerces.dom.TextImpl
  1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
  1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
  1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
  58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
  58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}

Much better, now the number of strings is near the number of cells.

{anchor:ReduceEmptyAttributeMaps}
h3. REDUCE EMPTY ATTRIBUTE MAPS

However, the number of {{AttributeMap}} is too high.  Browsing instances of 
{{AttributeMap}} reveals that each cell has two elements: a {{"table-cell"}} 
element and a {{"p"}} (paragraph) element.
{code}
<table:table-cell office:value-type="string"><text:p>Test data 
47014</text:p></table:table-cell>
{code}
Only the {{"table-cell"}} elements have an attribute 
({{office:value-type="string"}}), the {{"p"}} elements have no attributes.

An empty {{AttributeMap}} may be created and stored in an {{Element}} if xerces 
{{ElementImpl.getAttributes()}} is called when there are no attributes.  To 
avoid this, a caller should check if the {{Element.hasAttributes()}} and only 
call {{Element.getAttributes()}} if so.

Setting a breakpoint on {{ElementImpl.getAttributes()}} reveals that 
{{odfdom.pkg.rdfa.DOMRDFaParser}} is the culprit.  To eliminate the creation of 
empty {{AttributeMap}}:

{panel:title=part5}
1. Change DOMRDFaParser.process to check whether an
   Element.hasAttributes().  If not, do not call
   Element.getAttributes(), instead, use a static EmptyAttributes
   object.
{panel}

With this change, a heap dump during a profile run shows:

{code}
  1.7M char[]
  1.7M String
  1.7M Object[]
  1.7M Vector
  1.6M org.apache.xerces.dom.AttributeMap
  1.6M org.apache.xerces.dom.TextImpl
  1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
  1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
  1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
  58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
  58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}

Now the number of {{AttributeMap}} matches the number of cells.

{anchor:TableCellMemoryFootprint}
h3. TABLE-CELL MEMORY FOOTPRINT

The test case file has cells represented as follows:
{code}
<table:table-cell office:value-type="string"><text:p>Test data 
47014</text:p></table:table-cell>
{code}
After these patches, all the strings are shared by many cells, except the 
content
strings like "Test data 47014".  So the memory foot print is as follows:

{code}
- (17+2fields) Element "table-cell" (TableTableCellElement)
- ( 4+2fields) Element "table-cell" AttributeMap
- ( 4+2fields) Element "table-cell" AttributeMap Vector
- ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
                      (4 array slots are null, and could be reclaimed in
                       theory, but the vector is not public so not easy.)
- ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
- (17+2fields) Element "p" (OdfTextParagraph)
- ( 5+2fields) TextImpl
- ( 2+2fields) String
~ (15 char) char array "Test data 57014"
____________
 ~61 fields + 9 * 2 (for object headers) + data
 is about  80 words of memory.
 or about 320 bytes (4-byte words in 32bit-JVM)
 or about 640 bytes (8-byte words in 64bit-JVM)
{code}
As noted, especially for large data spreadsheets, the full literal DOM tree is 
not a space-efficient representation, so it requires the JVM to have access to 
plenty of memory.  The JVM default maximum memory is often one quarter of 
system RAM, so specifying a larger {{java -Xmx}} value may be required if the 
default is too small.

{anchor:UsersCanFurtherReduceMemory}
{panel:title=Users can further reduce memory footprint of this file.}
In this file, the cell values are unformatted strings, so they could 
alternatively be stored using an attribute rather than a nested paragraph.
{code}
<table:table-cell office:value-type="string" office:string-value="Test data 
47014"/></table:table-cell>
{code}
This is longer xml text, and does not compress as well for some reason, so the 
file is larger on disk.

But in memory, this removes the large {{text p}} element as well as the 
{{TextImpl}} object, and adds the {{office:string-value}} attribute name.  With 
this reduced xml, each cell has the following object sizes:
{code}
- (17+2fields) Element "table-cell" (TableTableCellElement)
- ( 4+2fields) Element "table-cell" AttributeMap
- ( 4+2fields) Element "table-cell" AttributeMap Vector
- ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
                      (4 elements are null, so 32 B could be reclaimed in
                       theory, but the vector is not public so not easy.)
- ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
- ( 7+2fields) Element "table-cell" Attr "office:string-value='Test data 57014'"
- ( 2+2fields) String
~ (15 char) char array "Test data 57014"
______________
 ~46 fields + 8 * 2 (for object headers) + data
 is about  62 words of memory.
 or about 248 bytes (4-byte words in 32bit-JVM)
 or about 496 bytes (8-byte words in 64bit-JVM)
{code}
Even though the file is longer, this ~20% reduction in memory can reduce the 
runtime of the test case by 20%.
{panel}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to