Nimarukan created ODFTOOLKIT-427:
------------------------------------
Summary: PERFORMANCE: Table getRowCount, getRowByIndex cache
improves getCellByPosition
Key: ODFTOOLKIT-427
URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-427
Project: ODF Toolkit
Issue Type: Improvement
Components: simple api
Environment: odfdom-java-0.8.11-incubating-SNAPSHOT
simple-odf-0.8.2-incubating-SNAPSHOT
jdk 1.8.0_77, win7
Reporter: Nimarukan
Priority: Minor
org.odftoolkit.simple.table
Table.getRowCount
is a top hotspot in a (NetBeans) sample profiling run of test_1 of
https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
Table.getCellByPosition is used to fill a cell in 5000 rows of this test.
Table.getCellByPosition calls getRowCount, which scans all rows to count them.
So filling one cell in _n_ rows with getCellByPosition runs in O(_n_^2^) time.
This can be avoided by caching the rowCount, so getRowCount runs in
O(_n_) time the first time and O(1) time thereafter as long as no rows
are inserted or removed (a common case when reading a spreadsheet).
On test_1 this eliminates getRowCount as a profiling hot spot; it barely
registers, taking less than 1% of its original time.
Table.getCellByPosition also calls getRowByIndex, which scans all rows
up to the row at index. (A row element in the xml may have a repeat
count, so the row at index i is not necessarily represented by the
i'th row element.) If getCellByPosition is used for multiple cells
in a row, this scan is currently repeated. It can partially be
avoided by caching the rows found, so getting the same row again for
another cell runs in O(1) time. On test_1 this reduces total time in
getRowByIndex by around 40%, but unlike many spreadsheets test_1 does
not fill many cells in one row, so impact may be lower than in other
programs.
The Simple API is a wrapper around the underlying ODF DOM (xml). The
cached values can be decached when the underlying DOM changes by
listening for MutationEvents. In particular, the rows and row count
can be invalidated when rows are added or removed, either by adding or
deleting row elements, or changing a number rows repeated attribute.
(Simply clearing the cache is conservative. An alternate approach
would be to change the cached row count by the number of rows removed,
and change the size of the row cache by the number of rows removed.
However, if many rows are being added or removed in sequence, then
repeatedly adjusting the size and indexes of the cache may add more
work than simply clearing it and recomputing it when reading begins
again. So for now the simple approach is to simply clear the cache.)
(Another option not taken is for getRowByIndex to cache all the rows
the first time the rows are scanned to find a row. Then getRowByIndex
would be more like getRowCount: O(_n_) the first time, and O(1)
thereafter as long as no rows were added or removed. However, if rows
*are* being added, then this worsens the O(_n_) cost for each row.
Users who want this performance can already get it by using
getRowIterator or getRowList rather than getRowIndex or
getCellByPosition.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)