Hi Kay,
On Tue, Sep 13, 2005 at 17:38:52 +0200, Kay Ramme - Sun Germany - Hamburg wrote:
> investigating especially into construction&destruction and suggesting a
> slightly modified variant, with improved behavior. So, if you have some
> time left, I would like to get your feedback ;-).
Thank you for your rich and detailed analysis!
I fully second your conclusion about the heap-string-with-a-buffer. For
different scenarios it could even be an advantage, in time and space, to
have string types with different buffer sizes, e.g. 8, 16, 32, 64.
I would like to add an additional view for the scenario of loading
a spreadsheet, which may also apply to loading other non-spreadsheet
OpenDocument files. I didn't verify though if it holds true in its
entirety nor do I know if it would be feasible to implement it in
a timely manner.
<table:table-cell office:value-type="float" office:value="10">
<text:p>10</text:p>
</table:table-cell>
As you stated, the steps involved for parsing value cells are
1. the XML file gets parsed, about 9 strings (for every tag and value)
get created for every cell,
2. the strings are passed via callbacks to the Calc engine,
3. the Calc engine converts the content string into a number
4. the strings get destructed.
The following values are known in advance and do not change:
- length("table:table-cell") = 16c
- length("office:value-type") = 17c
- length("office:value") = 12c
- length("text:p") = 6c
Here we could go one step further. It could be possible for the parser
to preallocate sets of strings for different usage scenarios, here cell
of value type float with value and textual representation, and store
them in a map, e.g. "table:table-cell", "office:value-type", "float", or
even a combined "office:value-type=\"float\"", "office:value", "text:p".
The map would have to be accessible by unique IDs that are generated
from the sequences of characters using a collision-free algorithm. While
parsing the stream, IDs would be generated on the fly (hence
collision-free to not make it necessary to pass strings around for
comparison like it is necessary in a general hash map) and matches
against the map could then return references to a shared sequence of
a string. This would eliminate all allocations and deallocations of
steps #1 and #4 above where fixed strings that are known in advance are
involved, thus reducing concurrency of your example from 3 to 1 (the
value "10").
Using your numbers and replacing the average string length of 12.75 with
8 (assumption of medium value string length) and adjusting the
probability of a value string matching the buffer length it would be
- buffer_size == 16c
- probability(sequence >= buffer) == 5%
- probability(sequence < buffer) == 95%
construction_time := 400000 * (0.05 * (519insts + 35insts * 8) + 0.95 * 35insts
* 8)
== 122380000 insts
(yours: 846900000 insts)
destruction_time := 400000 * 0.05 * 465insts
== 9300000 insts
(yours: 279000000 insts)
an overall gain of (122380000 + 9300000 == 131680000) compared to the
previous (846900000 + 279000000 == 1125900000) resulting in factor 0.11
of the previously needed time of your example, compared to the current
shared-heap-string (1716300000 insts) it is even factor 0.07 and not
"only" 0.65.
Now the question remains whether there is an algorithm that generates
2*400000 unique IDs and references to shared sequences in less than
(1125900000 - 131680000 == 994220000) instructions (~ 1242 insts per
string) and whether UDK would like to implement such a map :-)
Note: all of this is just a quick idea resulting of a brainstorm,
nothing seriously investigated, I didn't even verify the numbers..
Eike
--
OOo/SO Calc core developer. Number formatter bedevilled I18N transpositionizer.
GnuPG key 0x293C05FD: 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 293C 05FD
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]