Nimarukan created ODFTOOLKIT-434:
------------------------------------
Summary: PERFORMANCE/SPACE: Reduce memory per table cell
Key: ODFTOOLKIT-434
URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-434
Project: ODF Toolkit
Issue Type: Improvement
Components: odfdom
Affects Versions: 0.6.2-incubating
Environment: odfdom-java-0.8.11-incubating-SNAPSHOT,
simple-odf-0.8.2-incubating-SNAPSHOT, jdk1.8.0_79, MSWin7
Reporter: Nimarukan
Priority: Minor
h2. PERFORMANCE/SPACE: Reduce memory per table cell
ODFTOOLKIT-333 provides a [test
case|https://issues.apache.org/jira/secure/attachment/12806838/odftoolkit-333-test.zip]
with file bigFile.ods, which is 1.3MB in normal compressed form, or ~180MB
uncompressed.
Reading the file takes 1.5GB or so, which can cause a 64bit JVM with default
memory settings to run out of memory on a system with less than 6GB RAM
(assuming default -Xmx size is one quarter system RAM).
(I ran the test case using simple-odf-0.8.2-inclubating-SNAPTSHOT and
odfdom-java-0.8.11-incubating-SNAPSHOT from svn trunk, plus patches from
ODFTOOLKIT-424, approach A, which reduces initial runtime by a factor of 12 or
so over simpleapi 0.8.1 and odfdom 0.8.10.)
With the changes proposed below, the ODFTOOLKIT-333 test case runs in 25% less
time with unconstrained memory (java option {{-Xmx3000M}}). With less memory
than {{-Xmx2200M}}, the changes produce greater improvement because fewer
full-gc passes occur.
The changes:
* part1: Precompute OdfName qName
* part2: Use precomputed OdfName parts for table-cell element name, do not
store new ones.
* part3: Use precomputed OdfName parts for value-type attribute name, do not
store new ones.
* part4: Use OfficeValueTypeAttribute.Value for value-type attribute value, do
not store new ones.
* part5: Avoid creating an empty AttributeMap on p elements with no attributes.
These changes reduce the memory requirement by about 20% (1.5GB to 1.2GB).
Contents
- [Initial diagnosis|#InitialDiagnosis]
- [Reduce duplicate element name strings|#ReduceElementNameStrings]
- [Reduce duplicate attribute name strings|#ReduceAttributeNameStrings]
- [Reduce duplicate value type strings|#ReduceValueTypeStrings]
- [Reduce empty attribute maps|#ReduceEmptyAttributeMaps]
- [Table cell memory footprint|#TableCellMemoryFootprint]
** [Users can further reduce memory|#UsersCanFurtherReduceMemory]
{anchor:InitialDiagnosis}
h3. INITIAL DIAGNOSIS
A heap dump during a profiled run showed (in Netbeans) that the top memory uses
are:
{code}
6.7M char[]
6.7M String
2.7M org.apache.xerces.dom.AttributeMap
1.3M Object[]
1.3M Vector
1.3M org.apache.xerces.dom.TextImpl
1.3M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
1.3M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
1.3M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
47K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
47K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
So it looks like there were about 47K rows holding 1.3M cells.
But why so many Strings?
Each table cell is represented as elements such as:
{code}
<table:table-cell office:value-type="string"><text:p>Test data
47014</text:p></table:table-cell>
{code}
Browsing the latest {{String}} instances shows a large number of them are:
* element tag name parts like {{"table-cell"}} and {{"p"}},
* attribute name parts like {{"office"}} and {{"value-type"}}
* attribute values like {{"string"}},
* and the content string values in the cells, like {{"Test data 47021"}}.
{anchor:ReduceElementNameStrings}
h3. REDUCE DUPLICATE ELEMENT TAG NAME STRINGS
The element tag names {{"table-cell"}} and {{"p"}} should be shared, not
duplicated for every cell.
{panel}
1. {{TableTableCellElement}} defines a constant {{ELEMENT_NAME}} which is an
{{OdfName}}.
2. {{TableTableCellElement}} passes the {{OdfName}} to
{{TableTableCellElementBase}}.
3. {{TableTableCellElementBase}} passes the {{OdfName}} to
{{OdfStylableElement}}.
4. {{OdfStylableElement}} passes {{name.getURI()}} and {{name.getQName()}} to
{{OdfElement}}.
*CULPRIT 1*: {{OdfName.getQName()}} constructs a new string each time it
is called, concatentating the namespace prefix and the local name.
5. {{OdfElement}} passes the {{qName}} to to {{xerces.dom.ElementNSImpl}}.
6. {{ElementNSImpl(ownerDoc, ns, qname)}} stores the prefix and local name.
*CULPRIT 2*: {{ElementNSImpl}} creates strings for the prefix and local
name, checks them, and stores the local name.
{panel}
To avoid creating strings for every element tag qname, prefix, and local name:
{panel:title=part1}
1. {{OdfName}} needs to precompute the qName.
{panel}
{panel:title=part2}
4. {{OdfStylableElement(ownerDoc, OdfName, ...)}}
must call {{OdfElement(ownerDoc, OdfName)}}
\[not {{OdfElement(ownerDoc, ns, qname)}}]
5. {{OdfElement(ownerDoc, OdfName)}}
must call {{ElementNSImpl(ownerDoc, ns, qname, localName)}}
\[not {{ElementNSImpl(ownerDoc, ns, qname)}}]
{panel}
After this change a profile run showed the following:
{code}
4.8M char[]
4.8M String
3.1M org.apache.xerces.dom.AttributeMap
1.6M Object[]
1.6M Vector
1.5M org.apache.xerces.dom.TextImpl
1.5M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
1.5M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
1.5M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
55K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
55K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
(row numbers are larger because snapshot was later in run)
Browsing the latest {{String}} instances shows the element names
{{"table-cell"}} and {{"p"}} are no longer frequent.
{anchor:ReduceAttributeNameStrings}
h3. REDUCE DUPLICATE ATTRIBUTE NAME STRINGS
A large number of remaining strings are attribute parts like {{"office"}},
{{"value-type"}}, {{"string"}}, plus the test string values in the cells, like
{{"Test data 47021"}}.
Attribute name parts like {{"office"}} and {{"value-type"}} should be shared,
not duplicated for every cell.
{panel}
*CULPRIT 3*: {{AttrNSImpl(ownerDoc, ns, qName)}} creates strings for the
prefix and local name, checks them, and stores the local name.
{panel}
To share the attribute name strings, a similar change is needed:
{panel:title=part3}
1. {{OdfAttribute(ownerDoc, OdfName)}}
must call {{AttrNSImpl(ownerDoc, ns, qName, localName)}}
\[not {{AttrNSImpl(ownerDoc, ns, qName)}}]
{panel}
After adding this change a profile run showed the following:
{code}
3.4M char[]
3.4M String
3.4M org.apache.xerces.dom.AttributeMap
1.7M Object[]
1.7M Vector
1.6M org.apache.xerces.dom.TextImpl
1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
60K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
60K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
Browsing the latest instances shows {{"office"}} and {{"value-type"}} are no
longer frequent.
{anchor:ReduceValueTypeStrings}
h3. REDUCE DUPLICATE VALUE TYPE STRINGS
The {{value-type}} attribute value {{"string"}} is duplicated for each cell.
To share {{value-type}} attribute value strings, such as {{"string"}} in
{{office:value-type="string"}}, do not store the string from the input.
Instead, use the value to find the enum {{OfficeValueTypeAttribute.Value}}.
{panel:title=part4}
1. OfficeValueTypeAttribute_setAttribute(stringValue)
Find enum value with
OfficeValueTypeAttribute.Value.enumValueOf(stringValue)
If not null, use its string instead of the stringValue.
{panel}
After adding this change, a profile run showed the following:
{code}
3.3M org.apache.xerces.dom.AttributeMap
1.7M char[]
1.7M String
1.7M Object[]
1.7M Vector
1.6M org.apache.xerces.dom.TextImpl
1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
Much better, now the number of strings is near the number of cells.
{anchor:ReduceEmptyAttributeMaps}
h3. REDUCE EMPTY ATTRIBUTE MAPS
However, the number of {{AttributeMap}} is too high. Browsing instances of
{{AttributeMap}} reveals that each cell has two elements: a {{"table-cell"}}
element and a {{"p"}} (paragraph) element.
{code}
<table:table-cell office:value-type="string"><text:p>Test data
47014</text:p></table:table-cell>
{code}
Only the {{"table-cell"}} elements have an attribute
({{office:value-type="string"}}), the {{"p"}} elements have no attributes.
An empty {{AttributeMap}} may be created and stored in an {{Element}} if xerces
{{ElementImpl.getAttributes()}} is called when there are no attributes. To
avoid this, a caller should check if the {{Element.hasAttributes()}} and only
call {{Element.getAttributes()}} if so.
Setting a breakpoint on {{ElementImpl.getAttributes()}} reveals that
{{odfdom.pkg.rdfa.DOMRDFaParser}} is the culprit. To eliminate the creation of
empty {{AttributeMap}}:
{panel:title=part5}
1. Change DOMRDFaParser.process to check whether an
Element.hasAttributes(). If not, do not call
Element.getAttributes(), instead, use a static EmptyAttributes
object.
{panel}
With this change, a heap dump during a profile run shows:
{code}
1.7M char[]
1.7M String
1.7M Object[]
1.7M Vector
1.6M org.apache.xerces.dom.AttributeMap
1.6M org.apache.xerces.dom.TextImpl
1.6M org.odftoolkit.odfdom.dom.element.table.TableTableCellElement
1.6M org.odftoolkit.odfdom.dom.attribute.office.OfficeValueTypeAttribute
1.6M org.odftoolkit.odfdom.incubator.doc.text.OdfTextParagraph
58K org.odftoolkit.odfdom.dom.attribute.table.TableStyleNameAttribute
58K org.odftoolkit.odfdom.dom.element.table.TableTableRowElement
{code}
Now the number of {{AttributeMap}} matches the number of cells.
{anchor:TableCellMemoryFootprint}
h3. TABLE-CELL MEMORY FOOTPRINT
The test case file has cells represented as follows:
{code}
<table:table-cell office:value-type="string"><text:p>Test data
47014</text:p></table:table-cell>
{code}
After these patches, all the strings are shared by many cells, except the
content
strings like "Test data 47014". So the memory foot print is as follows:
{code}
- (17+2fields) Element "table-cell" (TableTableCellElement)
- ( 4+2fields) Element "table-cell" AttributeMap
- ( 4+2fields) Element "table-cell" AttributeMap Vector
- ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
(4 array slots are null, and could be reclaimed in
theory, but the vector is not public so not easy.)
- ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
- (17+2fields) Element "p" (OdfTextParagraph)
- ( 5+2fields) TextImpl
- ( 2+2fields) String
~ (15 char) char array "Test data 57014"
____________
~61 fields + 9 * 2 (for object headers) + data
is about 80 words of memory.
or about 320 bytes (4-byte words in 32bit-JVM)
or about 640 bytes (8-byte words in 64bit-JVM)
{code}
As noted, especially for large data spreadsheets, the full literal DOM tree is
not a space-efficient representation, so it requires the JVM to have access to
plenty of memory. The JVM default maximum memory is often one quarter of
system RAM, so specifying a larger {{java -Xmx}} value may be required if the
default is too small.
{anchor:UsersCanFurtherReduceMemory}
{panel:title=Users can further reduce memory footprint of this file.}
In this file, the cell values are unformatted strings, so they could
alternatively be stored using an attribute rather than a nested paragraph.
{code}
<table:table-cell office:value-type="string" office:string-value="Test data
47014"/></table:table-cell>
{code}
This is longer xml text, and does not compress as well for some reason, so the
file is larger on disk.
But in memory, this removes the large {{text p}} element as well as the
{{TextImpl}} object, and adds the {{office:string-value}} attribute name. With
this reduced xml, each cell has the following object sizes:
{code}
- (17+2fields) Element "table-cell" (TableTableCellElement)
- ( 4+2fields) Element "table-cell" AttributeMap
- ( 4+2fields) Element "table-cell" AttributeMap Vector
- ( 5+2fields) Element "table-cell" AttributeMap Vector Object array
(4 elements are null, so 32 B could be reclaimed in
theory, but the vector is not public so not easy.)
- ( 7+2fields) Element "table-cell" Attr "office:value-type='string'"
- ( 7+2fields) Element "table-cell" Attr "office:string-value='Test data 57014'"
- ( 2+2fields) String
~ (15 char) char array "Test data 57014"
______________
~46 fields + 8 * 2 (for object headers) + data
is about 62 words of memory.
or about 248 bytes (4-byte words in 32bit-JVM)
or about 496 bytes (8-byte words in 64bit-JVM)
{code}
Even though the file is longer, this ~20% reduction in memory can reduce the
runtime of the test case by 20%.
{panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)