(posted to sc and performance lists; follow-ups should be on the
performance list)
Hi everyone,
For a while, there has been the idea to speed up the saving of small
changes to large spreadsheet documents by generating only the XML
elements for the changed cells.
I did a little experiment to find out how much time this can really
save. It's not a complete implementation, just a test to see what's
possible. This is what I changed:
- Loading of the file is unchanged. No additional information is kept in
memory after the file is loaded.
- What I keep in memory is the list of changes, as long as only
supported changes were made. For this test, only input of numbers into
cells that weren't empty before is supported. If any other changes are
made, the list is discarded and the normal save code is used. For
simplicity, it is assumed that the input doesn't change the cell's
format. Changed formula results are ignored (in a real implementation,
they have to be tracked, too).
- When the file is saved, and the content.xml stream is generated, first
the old content.xml is opened and parsed for the positions of the
affected cell elements (see below for details). Then, the output is
generated by copying the non-affected parts and inserting a new simple
cell element instead of each affected cell.
- The styles.xml stream is copied in unchanged form.
To find the affected cell elements in the XML stream quickly, expat's C
interface is used, searching for qualified element names table:table,
table:table-row, table:table-cell and table:covered-table-cell. Because
this is just an optimization, I can assume that our own namespace
prefixes are used. Parsing is stopped when the last affected cell was found.
For testing, I used a large, simple file with 500000 cells, only text
and numbers.
In CPU time (measured with callgrind), normal saving takes 20.1 billion
cycles. Incremental saving with three values changed at the top of the
file takes 6.3 billion, and with three values changed at the bottom it
takes 11.4 billion cycles. Note that a real implementation would
probably need some additional checks that might increase the time again.
In exchange for these saved CPU cycles, there is a bit of additional
file access to read the old streams again. This isn't much, because they
can be read from the compressed file. Reliably measuring the total time
isn't easy, but quick tests with a larger file show the time for
incremental saving to be around 50% of normal for changes at the top,
and 70% for changes at the bottom. On other machines, this might be
totally different.
What next?
As mentioned above, the immediate goal of this was only to get the
numbers of what is possible. Before making a real implementation of
this, we should see where we can get by improving the normal saving,
because that would benefit all usages, not only specific, limited
modifications.
For reference, the code of this experiment is in the CWS "calcincsave".
I also put this text into the wiki, at
http://wiki.services.openoffice.org/wiki/Calc/Performance/Incremental_Saving.
Niklas
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@sc.openoffice.org
For additional commands, e-mail: dev-h...@sc.openoffice.org