Re: [Evolution-hackers] mmap() for the summary file
Hi, [EMAIL PROTECTED] wrote: H... I am not really convinced yet; I mean even when I have 100 mails with the same email address in my Inbox (which seems a lot from one person!). The email-address would be 20 chars; now having 'm all in one string saves me, even in this extreme case, only about 2Kb, event forgetting about the overhead of the hash tables etc. as long as we talk about individual senders this is true, but if you think about mailinglists and all this automated server notifications this could save quite some memory. As long as it is possible to implement this without a too big effort and code overhead it would be useful. Of course - even saving 100kb which would be optimistic for a huge mailbox is not be worth to mention on a modern PC - but on a mobile device that would be nice to have. Greetings Florian -- The dream of yesterday Florian Boor is the hope of todayTel: +49 271-771091-14 and the reality of tomorrow.Fax: +49 271-771091-19 [Robert Hutchings Goddard, 1904][EMAIL PROTECTED] 1D78 2D4D 6C53 1CA4 5588 D07B A8E7 940C 25B7 9A76 ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
Re: [Evolution-hackers] mmap() for the summary file
On Mon, 2006-06-19 at 11:56 +0200, Florian Boor wrote: as long as we talk about individual senders this is true, but if you think about mailinglists and all this automated server notifications this could save quite some memory. As long as it is possible to implement this without a too big effort and code overhead it would be useful. Of course - even saving 100kb which would be optimistic for a huge mailbox is not be worth to mention on a modern PC - but on a mobile device that would be nice to have. But the complexity for getting this *in* the file that is to be mmap'ed does increase then. Also the backward compatibility is undoable this way. You'll need to implement some sort of hash-table with pointers to strings at the beginning of the file, hash-keys in the summary tree itself. And the strings at the end of the file. Agreed that this would dramatically increase loading time (the time needed to seek/walk-a-pointer over the entire mmap()ed file and create summary-info structs). As it would decrease the amount of data that is to be mmap()ed. I agree this is worthwhile for the current read() to memory implementation. I don't think the complexity is going to payoff in real improvements for a mmap() implementation of it. But I would, nevertheless, be very interested in the experiment. Note that there's an mmap() on Windows and that glib abstracts it with some glib API. I would of course use the glib abstraction API for this. -- Philip Van Hoof, software developer at x-tend home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org work: vanhoof at x-tend dot be http://www.pvanhoof.be - http://www.x-tend.be ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
[Evolution-hackers] mmap() for the summary file
Hi there, I've been trying to replace the fread()/fopen() implementation in camel-folder-summary.c with an mmap() one. I know camel-file-utils.c will put duplicate strings in a hashtable and that way reduce memory usage for the summary information. Because a lot mail boxes have duplicate strings for the From and To headers. I know why and how this is implemented. And I understand that this already reduces memory usage a lot. However. On a small device with few memory resources, the kernel knows better when to allocate and when to deallocate uncachable data like this summary information. Therefore I propose to replace the implementation with mmap(). Not only I propose it, I also already tried it myself. While trying this, I came to the conclusion that it *would* be possible if the strings would have been terminated by '\0' in stead of being stored pascal-like using an encoded unsigned 32 bit integer in front of the string data. That decision makes this (using the current file format) impossible, unless the mmap'd memory (and therefore also the file on disk) is constantly rewritten (with '\0' characters) or unless the entire infrastructure that uses the summary strings is adapted to use this length information rather than using the strings directly from the mmap'ed memory *as* NULL terminated C strings (char arrays with a NULL termination). The second solution implies that all would have to be converted to GString's. I think it would reduce memory usage of Evolution with ~40mb (depending on the total amount of summary information being loaded). It would make the sorting of the header summary view a little bit slower on certain machines (mainly on machines that have very few memory resources left, so that the kernel will not put a lot of this mmap'ed data in its buffers/cache). The file format should be adapted in two ways: - Duplicate strings will need to be stored at only one location *on the disk*. So the hashtable implementation wouldn't be a memory-only but also a in-the-summary-file something. For example: A string-field can be a pointer to the first character of the string, or a pointer to another location in the file (in the mmap). - Strings will need to be '\0' terminated *in the file* so that they are directly usable from the mmap() memory block. Who are the brave souls that want to join me with this brain-damaging idea? And would a change like this (which would mean that a migration procedure each time an old folder-summary is loaded would need to run) ever get upstream? I measured (using valgrind) that most of the Evolution memory usages goes to storing a in-memory version of the summary files. I also measured that there's quite a lot memory segmentation going on (while loading the summary file) and that it (the memory for the file) consumes ~ twice as much as the on-filesystem filesize of the summary file. Loading using mmap() would be faster and wouldn't consume as much real memory (it would consume a mmap, that is true, and that memory would most likely go to the buffers/cache which the kernel manages, that is also true). Sorting might become a little bit slower (but probably not noticeable on most desktop hardware). I'm being serious, I would like to waste my time on this one. If the camel team of Evolution likes the idea (and wouldn't mind wasting some of their time on it as well). If not ... I'd rather wait for the disk-summary branch or for libspruce than to waste my time with it. Because forking camel would mean wasting huge amounts of time on maintaining a fork. I attached a patch with my current tryout. I already load the header of the summary file using mmap. That is already working. The difficult part is, however, making the strings themselves usable. Because those aren't NULL terminated. But please check the patch, you'll immediately see what I mean. Copying the strings, and NULL terminating the copy, is not a good option because that would make the entire mmap-concept pointless (you still copy it to real memory, so the entire reason-for-mmap then gone). Note that this is what the current implementation also does: it copies the string and null terminates the copy. And then frees the malloc that was allocated for reading it from the file. In fact is that copy unnecessary. Since fread() is a copy (and not like mmap also real on-disk data), it wouldn't matter if you'd use the original malloc()'d memory. This memory copying is probably causing the memory segmentation I mentioned above. If you'd implement it like this, you'd better at least used gslice. But anyway ;) -- Philip Van Hoof, software developer at x-tend home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org work: vanhoof at x-tend dot be http://www.pvanhoof.be - http://www.x-tend.be ? camel-mime-tables.c ? finalise Index: camel-folder-summary.c === RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.c,v
[Evolution-hackers] mmap() for the summary file
Hi there, I've been trying to replace the fread()/fopen() implementation in camel-folder-summary.c with an mmap() one. I know camel-file-utils.c will put duplicate strings in a hashtable and that way reduce memory usage for the summary information. Because a lot mail boxes have duplicate strings for the From and To headers. I know why and how this is implemented. And I understand that this already reduces memory usage a lot. However. On a small device with few memory resources, the kernel knows better when to allocate and when to deallocate uncachable data like this summary information. Therefore I propose to replace the implementation with mmap(). Not only I propose it, I also already tried it myself. While trying this, I came to the conclusion that it *would* be possible if the strings would have been terminated by '\0' in stead of being stored pascal-like using an encoded unsigned 32 bit integer in front of the string data. That decision makes this (using the current file format) impossible, unless the mmap'd memory (and therefore also the file on disk) is constantly rewritten (with '\0' characters) or unless the entire infrastructure that uses the summary strings is adapted to use this length information rather than using the strings directly from the mmap'ed memory *as* NULL terminated C strings (char arrays with a NULL termination). The second solution implies that all would have to be converted to GString's. I think it would reduce memory usage of Evolution with ~40mb (depending on the total amount of summary information being loaded). It would make the sorting of the header summary view a little bit slower on certain machines (mainly on machines that have very few memory resources left, so that the kernel will not put a lot of this mmap'ed data in its buffers/cache). The file format should be adapted in two ways: - Duplicate strings will need to be stored at only one location *on the disk*. So the hashtable implementation wouldn't be a memory-only but also a in-the-summary-file something. For example: A string-field can be a pointer to the first character of the string, or a pointer to another location in the file (in the mmap). - Strings will need to be '\0' terminated *in the file* so that they are directly usable from the mmap() memory block. Who are the brave souls that want to join me with this brain-damaging idea? And would a change like this (which would mean that a migration procedure each time an old folder-summary is loaded would need to run) ever get upstream? I measured (using valgrind) that most of the Evolution memory usages goes to storing a in-memory version of the summary files. I also measured that there's quite a lot memory segmentation going on (while loading the summary file) and that it (the memory for the file) consumes ~ twice as much as the on-filesystem filesize of the summary file. Loading using mmap() would be faster and wouldn't consume as much real memory (it would consume a mmap, that is true, and that memory would most likely go to the buffers/cache which the kernel manages, that is also true). Sorting might become a little bit slower (but probably not noticeable on most desktop hardware). I'm being serious, I would like to waste my time on this one. If the camel team of Evolution likes the idea (and wouldn't mind wasting some of their time on it as well). If not ... I'd rather wait for the disk-summary branch or for libspruce than to waste my time with it. Because forking camel would mean wasting huge amounts of time on maintaining a fork. I attached a patch with my current tryout. I already load the header of the summary file using mmap. That is already working. The difficult part is, however, making the strings themselves usable. Because those aren't NULL terminated. But please check the patch, you'll immediately see what I mean. Copying the strings, and NULL terminating the copy, is not a good option because that would make the entire mmap-concept pointless (you still copy it to real memory, so the entire reason-for-mmap then gone). Note that this is what the current implementation also does: it copies the string and null terminates the copy. And then frees the malloc that was allocated for reading it from the file. In fact is that copy unnecessary. Since fread() is a copy (and not like mmap also real on-disk data), it wouldn't matter if you'd use the original malloc()'d memory. This memory copying is probably causing the memory segmentation I mentioned above. If you'd implement it like this, you'd better at least used gslice. But anyway ;) -- Philip Van Hoof, software developer at x-tend home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org work: vanhoof at x-tend dot be http://www.pvanhoof.be - http://www.x-tend.be ? camel-mime-tables.c ? finalise Index: camel-folder-summary.c === RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.c,v