Re: [Evolution-hackers] mmap() for the summary file

2006-06-19 Thread Florian Boor
Hi,

[EMAIL PROTECTED] wrote:
 H... I am not really convinced yet; I mean even when I have
 100 mails with the same email address in my Inbox (which seems 
 a lot from one person!). The email-address would be 20 chars; now having
 'm all in one string saves me, even in this extreme case, only
 about 2Kb, event forgetting about the overhead of the hash tables etc.

as long as we talk about individual senders this is true, but if you think about
 mailinglists and all this automated server notifications this could save quite
some memory. As long as it is possible to implement this without a too big
effort and code overhead it would be useful. Of course - even saving 100kb which
would be optimistic for a huge mailbox is not be worth to mention on a modern
PC - but on a mobile device that would be nice to have.

Greetings

Florian

-- 
The dream of yesterday  Florian Boor
is the hope of todayTel: +49 271-771091-14
and the reality of tomorrow.Fax: +49 271-771091-19
[Robert Hutchings Goddard, 1904][EMAIL PROTECTED]

1D78 2D4D 6C53 1CA4 5588  D07B A8E7 940C 25B7 9A76
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] mmap() for the summary file

2006-06-19 Thread Philip Van Hoof
On Mon, 2006-06-19 at 11:56 +0200, Florian Boor wrote:

 as long as we talk about individual senders this is true, but if you think 
 about
  mailinglists and all this automated server notifications this could save 
 quite
 some memory. As long as it is possible to implement this without a too big
 effort and code overhead it would be useful. Of course - even saving 100kb 
 which
 would be optimistic for a huge mailbox is not be worth to mention on a modern
 PC - but on a mobile device that would be nice to have.

But the complexity for getting this *in* the file that is to be mmap'ed
does increase then. Also the backward compatibility is undoable this
way.

You'll need to implement some sort of hash-table with pointers to
strings at the beginning of the file, hash-keys in the summary tree
itself. And the strings at the end of the file.

Agreed that this would dramatically increase loading time (the time
needed to seek/walk-a-pointer over the entire mmap()ed file and create
summary-info structs). As it would decrease the amount of data that is
to be mmap()ed. 

I agree this is worthwhile for the current read() to memory
implementation. I don't think the complexity is going to payoff in real
improvements for a mmap() implementation of it. But I would,
nevertheless, be very interested in the experiment.

Note that there's an mmap() on Windows and that glib abstracts it with
some glib API. I would of course use the glib abstraction API for this.


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


[Evolution-hackers] mmap() for the summary file

2006-06-11 Thread Philip Van Hoof
Hi there,

I've been trying to replace the fread()/fopen() implementation in
camel-folder-summary.c with an mmap() one.

I know camel-file-utils.c will put duplicate strings in a hashtable and
that way reduce memory usage for the summary information. Because a lot
mail boxes have duplicate strings for the From and To headers. I know
why and how this is implemented. And I understand that this already
reduces memory usage a lot.

However. On a small device with few memory resources, the kernel knows
better when to allocate and when to deallocate uncachable data like this
summary information.

Therefore I propose to replace the implementation with mmap(). Not only
I propose it, I also already tried it myself.

While trying this, I came to the conclusion that it *would* be possible
if the strings would have been terminated by '\0' in stead of being
stored pascal-like using an encoded unsigned 32 bit integer in front of
the string data.

That decision makes this (using the current file format) impossible,
unless the mmap'd memory (and therefore also the file on disk) is
constantly rewritten (with '\0' characters) or unless the entire
infrastructure that uses the summary strings is adapted to use this
length information rather than using the strings directly from the
mmap'ed memory *as* NULL terminated C strings (char arrays with a NULL
termination). The second solution implies that all would have to be
converted to GString's.

I think it would reduce memory usage of Evolution with ~40mb (depending
on the total amount of summary information being loaded). It would make
the sorting of the header summary view a little bit slower on certain
machines (mainly on machines that have very few memory resources left,
so that the kernel will not put a lot of this mmap'ed data in its
buffers/cache).

The file format should be adapted in two ways:

- Duplicate strings will need to be stored at only one location *on the
disk*. So the hashtable implementation wouldn't be a memory-only but
also a in-the-summary-file something.

For example: A string-field can be a pointer to the first character of
the string, or a pointer to another location in the file (in the mmap).

- Strings will need to be '\0' terminated *in the file* so that they are
directly usable from the mmap() memory block. 


Who are the brave souls that want to join me with this brain-damaging
idea? And would a change like this (which would mean that a migration
procedure each time an old folder-summary is loaded would need to run)
ever get upstream?

I measured (using valgrind) that most of the Evolution memory usages
goes to storing a in-memory version of the summary files. I also
measured that there's quite a lot memory segmentation going on (while
loading the summary file) and that it (the memory for the file) consumes
~ twice as much as the on-filesystem filesize of the summary file.

Loading using mmap() would be faster and wouldn't consume as much real
memory (it would consume a mmap, that is true, and that memory would
most likely go to the buffers/cache which the kernel manages, that is
also true). Sorting might become a little bit slower (but probably not
noticeable on most desktop hardware).

I'm being serious, I would like to waste my time on this one. If the
camel team of Evolution likes the idea (and wouldn't mind wasting some
of their time on it as well). If not ... I'd rather wait for the
disk-summary branch or for libspruce than to waste my time with it.
Because forking camel would mean wasting huge amounts of time on
maintaining a fork.

I attached a patch with my current tryout. I already load the header of
the summary file using mmap. That is already working. The difficult part
is, however, making the strings themselves usable. Because those aren't
NULL terminated. But please check the patch, you'll immediately see what
I mean.

Copying the strings, and NULL terminating the copy, is not a good option
because that would make the entire mmap-concept pointless (you still
copy it to real memory, so the entire reason-for-mmap then gone). Note
that this is what the current implementation also does: it copies the
string and null terminates the copy. And then frees the malloc that was
allocated for reading it from the file.

In fact is that copy unnecessary. Since fread() is a copy (and not like
mmap also real on-disk data), it wouldn't matter if you'd use the
original malloc()'d memory. This memory copying is probably causing the
memory segmentation I mentioned above. If you'd implement it like this,
you'd better at least used gslice.

But anyway ;)


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be
? camel-mime-tables.c
? finalise
Index: camel-folder-summary.c
===
RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.c,v

[Evolution-hackers] mmap() for the summary file

2006-06-11 Thread Philip Van Hoof
Hi there,

I've been trying to replace the fread()/fopen() implementation in
camel-folder-summary.c with an mmap() one.

I know camel-file-utils.c will put duplicate strings in a hashtable and
that way reduce memory usage for the summary information. Because a lot
mail boxes have duplicate strings for the From and To headers. I know
why and how this is implemented. And I understand that this already
reduces memory usage a lot.

However. On a small device with few memory resources, the kernel knows
better when to allocate and when to deallocate uncachable data like this
summary information.

Therefore I propose to replace the implementation with mmap(). Not only
I propose it, I also already tried it myself.

While trying this, I came to the conclusion that it *would* be possible
if the strings would have been terminated by '\0' in stead of being
stored pascal-like using an encoded unsigned 32 bit integer in front of
the string data.

That decision makes this (using the current file format) impossible,
unless the mmap'd memory (and therefore also the file on disk) is
constantly rewritten (with '\0' characters) or unless the entire
infrastructure that uses the summary strings is adapted to use this
length information rather than using the strings directly from the
mmap'ed memory *as* NULL terminated C strings (char arrays with a NULL
termination). The second solution implies that all would have to be
converted to GString's.

I think it would reduce memory usage of Evolution with ~40mb (depending
on the total amount of summary information being loaded). It would make
the sorting of the header summary view a little bit slower on certain
machines (mainly on machines that have very few memory resources left,
so that the kernel will not put a lot of this mmap'ed data in its
buffers/cache).

The file format should be adapted in two ways:

- Duplicate strings will need to be stored at only one location *on the
disk*. So the hashtable implementation wouldn't be a memory-only but
also a in-the-summary-file something.

For example: A string-field can be a pointer to the first character of
the string, or a pointer to another location in the file (in the mmap).

- Strings will need to be '\0' terminated *in the file* so that they are
directly usable from the mmap() memory block. 


Who are the brave souls that want to join me with this brain-damaging
idea? And would a change like this (which would mean that a migration
procedure each time an old folder-summary is loaded would need to run)
ever get upstream?

I measured (using valgrind) that most of the Evolution memory usages
goes to storing a in-memory version of the summary files. I also
measured that there's quite a lot memory segmentation going on (while
loading the summary file) and that it (the memory for the file) consumes
~ twice as much as the on-filesystem filesize of the summary file.

Loading using mmap() would be faster and wouldn't consume as much real
memory (it would consume a mmap, that is true, and that memory would
most likely go to the buffers/cache which the kernel manages, that is
also true). Sorting might become a little bit slower (but probably not
noticeable on most desktop hardware).

I'm being serious, I would like to waste my time on this one. If the
camel team of Evolution likes the idea (and wouldn't mind wasting some
of their time on it as well). If not ... I'd rather wait for the
disk-summary branch or for libspruce than to waste my time with it.
Because forking camel would mean wasting huge amounts of time on
maintaining a fork.

I attached a patch with my current tryout. I already load the header of
the summary file using mmap. That is already working. The difficult part
is, however, making the strings themselves usable. Because those aren't
NULL terminated. But please check the patch, you'll immediately see what
I mean.

Copying the strings, and NULL terminating the copy, is not a good option
because that would make the entire mmap-concept pointless (you still
copy it to real memory, so the entire reason-for-mmap then gone). Note
that this is what the current implementation also does: it copies the
string and null terminates the copy. And then frees the malloc that was
allocated for reading it from the file.

In fact is that copy unnecessary. Since fread() is a copy (and not like
mmap also real on-disk data), it wouldn't matter if you'd use the
original malloc()'d memory. This memory copying is probably causing the
memory segmentation I mentioned above. If you'd implement it like this,
you'd better at least used gslice.

But anyway ;)


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be
? camel-mime-tables.c
? finalise
Index: camel-folder-summary.c
===
RCS file: /cvs/gnome/evolution-data-server/camel/camel-folder-summary.c,v