On Fri, 2006-09-08 at 12:44 +0930, Michael Zucchi wrote: > Hmm, will it really reduce memory usage though? I'm not so sure. > Remember that although the summary file contains all strings on disk, > in memory they are actually unique-ized. For a lot of load-types > (e.g. mailing lists), this saves a trememdous amount of memory - even > after all the overhead of the hashtable to manage it. Address and > subject strings in particular. And a mmap file isn't terribly > different to performance to malloc'd memory - once the latter is > setup, in both cases the kernel is the one loading it off disk (swap, > or filesystem).
With the offsets in a different index mmap I can do the same trick of course. E.g. let the offset of two fields be the same (let them point to a uniqified version of the string, for example the first). It will indeed be tricky when writing out the summary and index to support this. But not impossible. Hard ... is fun :). If it wouldn't be hard, it wouldn't be a challenge, lots of our community members wouldn't be interested in writing it ;-) I am. I of course do have priorities for tinymail and other projects. But this being one of those funky ideas that I can't get rid of in my head, sorts it higher in my prio-list (else I go nuts). > (an mmap file IS just memory, so if "mmap'd file size + in-memory > support tables" > "in-memory version", then you haven't generally won > - even if 'ps' shows you have). I agree but a little but: An mmap can (and often will) be swapped in and swapped out by the kernel. It, in general, knows better, than the application developer possibly can, when it's a good idea to swap pages in and out of mmaps on the system. My opinion is that it's a particular interesting memory technique on mobile devices. Mobile device users often don't care a lot that switching between applications takes a few milliseconds before the recently switched-to application becomes usable (the swap-in happened). It can also be shared if multiple E-mail clients do things with the same summary files. I agree this is uncommon but on large installations -- e.g. servers that run a lot instances of the E-mail client with a lot shared folders or something like that -- this might be a significant improvement. An application like Evolution can probably be adapted to figure out about shared folders. Or an admin can be taught to set symbolic links to the cache folders of shared accounts as a work-around that can be used already (with some inotify magic that will reload the summaries of folders of which the summary file got altered by another process: for example respond with a mremap in the notified process). > Note also that all of the summary items, and all of the strings thus > stored, are using special allocators which reduce - significantly - > the overhead of malloc, on these small allocations. In earlier > versions I tried a different string allocator for each messageinfo > which mimicks much of what you're doing here, but in memory. Strings > were stored in an array structure, but I only stored a single pointer > to the base of the array, and packed the strings into a NUL separated > right-sized buffer, and searched each time I needed to access them, > overhead was only about 5 bytes (pointer to the 'array' plus an array > 'limit'). Performance was fine, but memory usage was significantly > larger than the current model, where strings are 'unique-ized' - like > 30% or more iirc. True. And we can mimic this behaviour with mmaps too. I do agree it would be complex. Especially in the current Camel design. That's why we need people like you, Michael, to help us develop the disk summary concept ;-) My opinion is that with a good design, all complex programming concepts become simple, understandable, maintainable and doable. KISS is too often used by people who have no f. idea what software development is about. KISS is not "doing everything in Python" or "don't type more than 5 lines per algorithm". I'm a supporter of splitting complex things up in as much simple pieces as needed to make all of the pieces understandable. Descartes was the dude that taught us that (other than that, I'm much more into Socrates by the way). My point of this dialog: too complex doesn't exist. We've put a few dudes on the moon and build technology that splits atoms. We, humans, are designed to overcome complexity. > Having extra indices on disk - you're really just writing a datbase > table - and all of the associated complexity, consistency and > coherency issues that implies. And unless you do something 'proper' > like a btree with a unique key, you're going to have to end up storing > plenty of extra support stuff in memory anyway, e.g. a hashtable to > find items by uid, the sorted list of messages for the view, and so > on. Finding the items would be done by calculating the offset in the index, which can be done each time access to one of its members is needed. I wouldn't use the uid but would use a "nth" integer. The uid, being fixed-sized, can one of the members in the index. Thus making it possible to search for the uid in this index (should be rather fast, in an mmap it's probably as fast as searching it in a typical hashtable). > We did a lot to reduce per-message overhead at the camel level, but > its an area of diminishing returns - the ui table view will still take > as much memory, etc. Aha but that can also be reduced if you develop a model that disposes the rows that become invisible. With fixed-row and fixed-width and implemented row_unref of a custom GtkTreeModel, you can already do something like this with GtkTreeView. I'm sure ETable and its model type can be adapted (or already are) to support something like this too. Tinymail is suitable and tested to do this (with GtkTreeView) already. I tested this principle with the Camel POP provider that doesn't have support for summaries. What I did was load CamelMimeMessage instances for each row that became visible. It worked, yet per row consumed a CamelMimeMessage: an enormous amount of memory (compared to a summary item). What didn't work was, of course, sorting. Because instantiating CamelMimeMessage instances for each to-sort item, took days. But the CamelMimeMessage instances of invisible rows weren't allocated. Only the visible ones. Which kept memory consumption of tinymail below 10M for a 15,000 messages POP folder (valgrind measured). This wasn't using the mmap concept, which in principle does more or less the same with the summary data (but then using 4k pages in the kernel). Viewing and scrolling, using this implementation, was already fast enough on a mobile device like the Nokia 770. I'm not going to call is "snappy" fast. Just "fast enough" or "acceptable fast". I am now, like Evolution, going to use the MBOX provider for this in tinymail. This support for POP without a summary, was indeed a proof of concept in tinymail. Thanks a lot for your point of view Michael. I value it. > On 08/09/06, Philip Van Hoof <[EMAIL PROTECTED]> wrote: > On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote: > > > To read message n, you would simply do something like: > > > > from = sstart + *(istart + (sizeof (int) * 4 * n) + 1) > > subject = sstart + *(istart + (sizeof (int) * 4 * n) + 2) > > to = sstart + *(istart + (sizeof (int) * 4 * n) + 3) > > flags = sstart + *(istart + (sizeof (int) * 4 * n) + 4) > > > > No no no no no. I meant (something like) this of course: > > > unsigned char *sstart = (uchar*)mmap(..), *istart = > (uchar*)mmap(..); > #define AMOUNT_OF_OFFSETS_PER_RECORD 4 > #define AOO AMOUNT_OF_OFFSETS_PER_RECORD > > from = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + > sizeof (uint32_t))) > subject = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + > sizeof (uint32_t))) > to = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + > sizeof (uint32_t))) > flags = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + > sizeof (uint32_t))) > > To much pointers in my head ;) > > -- > Philip Van Hoof, software developer at x-tend > home: me at pvanhoof dot be > gnome: pvanhoof at gnome dot org > work: vanhoof at x-tend dot be > http://www.pvanhoof.be - http://www.x-tend.be > > -- Philip Van Hoof, software developer at x-tend home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org work: vanhoof at x-tend dot be http://www.pvanhoof.be - http://www.x-tend.be _______________________________________________ Performance-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/performance-list
