Hi, I would like some input what would be a good design to allow string de- duplication in all geodata in the working memory.
MOTIVATION One bigger reason of all the memory used by Marble is that when loading data from files there are lots of repeating strings in the data that are also stored as string in the working memory. Still each string gets its own separate complete copy. *1 FIRST EXPERIMENT I did a quick hack to CacheRunner::parseFile(), to try de-duplication at least per cache file loaded. And could see for themes with usage of big cache files like bluemarble or openstreetmap a reduction of > 3 MB :) without noticable price on load time. *2 That seemed good enough for an intermediate improvement to go in as commit already, so pushed as 690fcf380985c5fefbb8531fbeb54a1432b49044 (https://quickgit.kde.org/?p=marble.git&a=commitdiff&h=690fcf380985c5fe) In that commit you can also see some similar changes to OSMParser (which accidentally slipped in). And they show that the per-file string de- duplication with lots of smaller files only yields little improvements (< 1 MB seen with hello-world example and vectorosm theme on plain app start), so not the silver bullet. CHALLENGE Perfect solution: there would not be any duplicated string used by all the geodata objects (there might be even more candidates, QDateTime is also implicitely shared at least in Qt5 and might have lots of duplicates). Which means: * on adding new geodata to the working memory, e.g. on loading from disk or the network, any string should be checked if there is already a duplicate and if so the original should be used. * on changing some string of some geodata, the new string should be checked against existing strings and instead an existing be used, if present, as well as the old string be checked if noone else is using it and if so being removed. * on removing some string of some geodata. the old string be checked if noone else is using it and if so being removed. APPROACH? Has there ever been any discussion about that? Any previous approaches? The part of removing no longer used strings from the working memory would be automatically solved by QString. But what to do about finding existing instances of a given QString? There could need some kind of lookup facility. like the QSet<QString> used in the commit for CacheRunner. Keeping such a global QSet<QString> around all the time has 2 disadvantages: * each QString object in the set would block the automatic removal of no longer used strings, as the entry still references it (and there seem no hooks in QString to help this) * steady memory footprint of the table (cost unknown) Creating the table on the fly on loading a new file or on a change has the disadvantage of a runtime price (cost unknown). There could be also something like a garbage^Wde-duplication collection which is run after some time when a file has been loaded or some changes have been made. Or a new class MString would be created which has that de-duplication somehow built in? YOUR INPUT! De-duplicating data so far seems a worthwhile thing to investigate more into to help with the unfortune memory usage of Marble. Being a newbie with Marble(-like) data structures I am curious what the veterans and everyone else has to say. Please do :) *1) A similar problem also exists with QString objects created on the fly from C++ raw strings e.g. in a loop or repeated function call. That is why I recently started to add QStringLiteral wrappers to more and more such raw strings, which results in turning the raw strings on compilation to a format that then will be shared by the created QString objects. Or QLatin1String in other places, where a QString instance can be avoided by method overloads in the used API. *2) Seen e.g. with valgrind --tool=massif build/marble/examples/cpp/hello-marble/hello-marble there the memory allocated by QArrayData::allocate(...) is reduced. Cheers Friedrich
