[
https://issues.apache.org/jira/browse/ODFTOOLKIT-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nimarukan updated ODFTOOLKIT-427:
---------------------------------
Attachment: Issue-427-patches.zip
(patch -l -p 1 -i file.patch)
427-part1-simple-Table-cacheRowCountAndColCount.patch
427-part2-simple-Table-cacheRowByIndex.patch
(result should preserve Table.java's unix line ends '\n', not '\r\n')
> PERFORMANCE: Table getRowCount, getRowByIndex cache improves getCellByPosition
> ------------------------------------------------------------------------------
>
> Key: ODFTOOLKIT-427
> URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-427
> Project: ODF Toolkit
> Issue Type: Improvement
> Components: simple api
> Environment: odfdom-java-0.8.11-incubating-SNAPSHOT
> simple-odf-0.8.2-incubating-SNAPSHOT
> jdk 1.8.0_77, win7
> Reporter: Nimarukan
> Priority: Minor
> Labels: performance
> Attachments: Issue-427-patches.zip
>
>
> org.odftoolkit.simple.table
> Table.getRowCount
> is a top hotspot in a (NetBeans) sample profiling run of test_1 of
> https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
> Table.getCellByPosition is used to fill a cell in 5000 rows of this test.
> Table.getCellByPosition calls getRowCount, which scans all rows to count them.
> So filling one cell in _n_ rows with getCellByPosition runs in O(_n_^2^) time.
> This can be avoided by caching the rowCount, so getRowCount runs in
> O(_n_) time the first time and O(1) time thereafter as long as no rows
> are inserted or removed (a common case when reading a spreadsheet).
> On test_1 this eliminates getRowCount as a profiling hot spot; it barely
> registers, taking less than 1% of its original time.
> Table.getCellByPosition also calls getRowByIndex, which scans all rows
> up to the row at index. (A row element in the xml may have a repeat
> count, so the row at index i is not necessarily represented by the
> i'th row element.) If getCellByPosition is used for multiple cells
> in a row, this scan is currently repeated. It can partially be
> avoided by caching the rows found, so getting the same row again for
> another cell runs in O(1) time. On test_1 this reduces total time in
> getRowByIndex by around 40%, but unlike many spreadsheets test_1 does
> not fill many cells in one row, so impact may be lower than in other
> programs.
> The Simple API is a wrapper around the underlying ODF DOM (xml). The
> cached values can be decached when the underlying DOM changes by
> listening for MutationEvents. In particular, the rows and row count
> can be invalidated when rows are added or removed, either by adding or
> deleting row elements, or changing a number rows repeated attribute.
> (Simply clearing the cache is conservative. An alternate approach
> would be to change the cached row count by the number of rows removed,
> and change the size of the row cache by the number of rows removed.
> However, if many rows are being added or removed in sequence, then
> repeatedly adjusting the size and indexes of the cache may add more
> work than simply clearing it and recomputing it when reading begins
> again. So for now the simple approach is to simply clear the cache.)
> (Another option not taken is for getRowByIndex to cache all the rows
> the first time the rows are scanned to find a row. Then getRowByIndex
> would be more like getRowCount: O(_n_) the first time, and O(1)
> thereafter as long as no rows were added or removed. However, if rows
> *are* being added, then this worsens the O(_n_) cost for each row.
> Users who want this performance can already get it by using
> getRowIterator or getRowList rather than getRowIndex or
> getCellByPosition.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)