Reducing working memory footprint by string deduplication of geodata

Friedrich W. H. Kossebau Wed, 31 Aug 2016 12:12:57 -0700

Hi,

I would like some input what would be a good design to allow string de-
duplication in all geodata in the working memory.



MOTIVATION

One bigger reason of all the memory used by Marble is that when loading data 
from files there are lots of repeating strings in the data that are also 
stored as string in the working memory. Still each string gets its own 
separate complete copy. *1


FIRST EXPERIMENT

I did a quick hack to CacheRunner::parseFile(), to try de-duplication at least 
per cache file loaded. And could see for themes with usage of big cache files 
like bluemarble or openstreetmap a reduction of > 3 MB :) without noticable 
price on load time. *2

That seemed good enough for an intermediate improvement to go in as commit 
already, so pushed as 690fcf380985c5fefbb8531fbeb54a1432b49044
(https://quickgit.kde.org/?p=marble.git&a=commitdiff&h=690fcf380985c5fe)

In that commit you can also see some similar changes to OSMParser (which 
accidentally slipped in). And they show that the per-file string de-
duplication with lots of smaller files only yields little improvements (< 1 MB 
seen with hello-world example and vectorosm theme on plain app start), so not 
the silver bullet.


CHALLENGE

Perfect solution: there would not be any duplicated string used by all the 
geodata objects (there might be even more candidates, QDateTime is also 
implicitely shared at least in Qt5 and might have lots of duplicates).

Which means:
* on adding new geodata to the working memory, e.g. on loading from disk or 
the network, any string should be checked if there is already a duplicate and 
if so the original should be used.
* on changing some string of some geodata, the new string should be checked 
against existing strings and instead an existing be used, if present, as well 
as the old string be checked if noone else is using it and if so being 
removed.
* on removing some string of some geodata. the old string be checked if noone 
else is using it and if so being removed.


APPROACH?

Has there ever been any discussion about that? Any previous approaches?

The part of removing no longer used strings from the working memory would be 
automatically solved by QString. But what to do about finding existing 
instances of a given QString?
There could need some kind of lookup facility. like the QSet<QString> used in 
the commit for CacheRunner.
Keeping such a global QSet<QString> around all the time has 2 disadvantages:
* each QString object in the set would block the automatic removal of no 
longer used strings, as the entry still references it (and there seem no hooks 
in QString to help this)
* steady memory footprint of the table (cost unknown)
Creating the table on the fly on loading a new file or on a change has the 
disadvantage of a runtime price (cost unknown).

There could be also something like a garbage^Wde-duplication collection which 
is run after some time when a file has been loaded or some changes have been 
made.

Or a new class MString would be created which has that de-duplication somehow 
built in?


YOUR INPUT!

De-duplicating data so far seems a worthwhile thing to investigate more into 
to help with the unfortune memory usage of Marble. Being a newbie with 
Marble(-like) data structures I am curious what the veterans and everyone else 
has to say. Please do :)


*1)
A similar problem also exists with QString objects created on the fly from C++ 
raw strings e.g. in a loop or repeated function call. That is why I recently 
started to add QStringLiteral wrappers to more and more such raw strings, 
which results in turning the raw strings on compilation to a format that then 
will be shared by the created QString objects. Or QLatin1String in other 
places, where a QString instance can be avoided by method overloads in the 
used API.

*2)
Seen e.g. with
        valgrind --tool=massif 
build/marble/examples/cpp/hello-marble/hello-marble
there the memory allocated by QArrayData::allocate(...) is reduced.

Cheers
Friedrich

Reducing working memory footprint by string deduplication of geodata

Reply via email to