[ 
https://issues.apache.org/jira/browse/ODFTOOLKIT-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nimarukan updated ODFTOOLKIT-427:
---------------------------------
    Attachment: Issue-427-patches.zip

(patch -l -p 1 -i file.patch)

427-part1-simple-Table-cacheRowCountAndColCount.patch
427-part2-simple-Table-cacheRowByIndex.patch

(result should preserve Table.java's unix line ends '\n', not '\r\n')


> PERFORMANCE: Table getRowCount, getRowByIndex cache improves getCellByPosition
> ------------------------------------------------------------------------------
>
>                 Key: ODFTOOLKIT-427
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-427
>             Project: ODF Toolkit
>          Issue Type: Improvement
>          Components: simple api
>         Environment: odfdom-java-0.8.11-incubating-SNAPSHOT
> simple-odf-0.8.2-incubating-SNAPSHOT
> jdk 1.8.0_77, win7
>            Reporter: Nimarukan
>            Priority: Minor
>              Labels: performance
>         Attachments: Issue-427-patches.zip
>
>
> org.odftoolkit.simple.table
> Table.getRowCount
> is a top hotspot in a (NetBeans) sample profiling run of test_1 of
> https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
> Table.getCellByPosition is used to fill a cell in 5000 rows of this test.
> Table.getCellByPosition calls getRowCount, which scans all rows to count them.
> So filling one cell in _n_ rows with getCellByPosition runs in O(_n_^2^) time.
> This can be avoided by caching the rowCount, so getRowCount runs in
> O(_n_) time the first time and O(1) time thereafter as long as no rows
> are inserted or removed (a common case when reading a spreadsheet).
> On test_1 this eliminates getRowCount as a profiling hot spot; it barely
> registers, taking less than 1% of its original time.
> Table.getCellByPosition also calls getRowByIndex, which scans all rows
> up to the row at index.  (A row element in the xml may have a repeat
> count, so the row at index i is not necessarily represented by the
> i'th row element.)  If getCellByPosition is used for multiple cells
> in a row, this scan is currently repeated.  It can partially be
> avoided by caching the rows found, so getting the same row again for
> another cell runs in O(1) time.  On test_1 this reduces total time in
> getRowByIndex by around 40%, but unlike many spreadsheets test_1 does
> not fill many cells in one row, so impact may be lower than in other
> programs.
> The Simple API is a wrapper around the underlying ODF DOM (xml).  The
> cached values can be decached when the underlying DOM changes by
> listening for MutationEvents.  In particular, the rows and row count
> can be invalidated when rows are added or removed, either by adding or
> deleting row elements, or changing a number rows repeated attribute.
> (Simply clearing the cache is conservative.  An alternate approach
> would be to change the cached row count by the number of rows removed,
> and change the size of the row cache by the number of rows removed.
> However, if many rows are being added or removed in sequence, then
> repeatedly adjusting the size and indexes of the cache may add more
> work than simply clearing it and recomputing it when reading begins
> again.  So for now the simple approach is to simply clear the cache.)
> (Another option not taken is for getRowByIndex to cache all the rows
> the first time the rows are scanned to find a row.  Then getRowByIndex
> would be more like getRowCount: O(_n_) the first time, and O(1)
> thereafter as long as no rows were added or removed.  However, if rows
> *are* being added, then this worsens the O(_n_) cost for each row.
> Users who want this performance can already get it by using
> getRowIterator or getRowList rather than getRowIndex or
> getCellByPosition.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to