Re: [fpc-devel] Memory consumed by strings
On Sat, 22 Nov 2008 23:05:43 +0200 listmember [EMAIL PROTECTED] wrote: Is there a way to determine how much memory is consumed by strings by a running application? I'd like to know this, in particular, for FPC ana Lazarus --to begin with. And, the reason I'd like to know this is this: Whenever I suggest that char size be increased to 4, the idea gets opposed on the grouds that it will need huge memory --4 times as much. There's of course some merit in that arguement, but I have no idea what it is '4 times' of. This is not very engineer-like --it being unmeasured. Can anyone suggest a way to measure the memory load caused by strings? The exact amount depends on the application, but think about loading text files of 100mb into strings. This will need at least the 100mb plus the overhead for each string (at least 12 bytes). With 2 byte chars an extra of 100mb would be needed and with 4 byte chars 300mb additional mem would be needed. For example the lazarus IDE typically holds 50 to 200mb sources in memory. If this would be changed to unicodestring (2 byte per char) then the IDE would need 50 to 200mb more memory. And because many time consuming tasks are already bound by the memory bandwidth of current computers, the IDE would become twice as slow. Do the math for 4 byte per char. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 10:19, Mattias Gaertner wrote: On Sat, 22 Nov 2008 23:05:43 +0200 listmember[EMAIL PROTECTED] wrote: Is there a way to determine how much memory is consumed by strings by a running application? I'd like to know this, in particular, for FPC ana Lazarus --to begin with. And, the reason I'd like to know this is this: Whenever I suggest that char size be increased to 4, the idea gets opposed on the grouds that it will need huge memory --4 times as much. There's of course some merit in that arguement, but I have no idea what it is '4 times' of. This is not very engineer-like --it being unmeasured. Can anyone suggest a way to measure the memory load caused by strings? The exact amount depends on the application, but think about loading text files of 100mb into strings. This will need at least the 100mb plus the overhead for each string (at least 12 bytes). With 2 byte chars an extra of 100mb would be needed and with 4 byte chars 300mb additional mem would be needed. For example the lazarus IDE typically holds 50 to 200mb sources in memory. If this would be changed to unicodestring (2 byte per char) then the IDE would need 50 to 200mb more memory. And because many time consuming tasks are already bound by the memory bandwidth of current computers, the IDE would become twice as slow. Do the math for 4 byte per char. What I had in mind wasn't to store the string data in UTF-32 (or UCS-4); it would still be UTF-8 or whatever. I am only considering in memory representation being UTF-32 (or UCS-4). This way, loading from and saving to would hardly be affected, yet in-memory operations would be a lot faster and more simplified. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 10:31:39 +0200 listmember [EMAIL PROTECTED] wrote: [...] What I had in mind wasn't to store the string data in UTF-32 (or UCS-4); it would still be UTF-8 or whatever. I am only considering in memory representation being UTF-32 (or UCS-4). What do you mean with 'memory representation'? This way, loading from and saving to would hardly be affected, yet in-memory operations would be a lot faster and more simplified. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
I am only considering in memory representation being UTF-32 (or UCS-4). What do you mean with 'memory representation'? That, each char in a string in memory would be 4-bytes (or more); yet, when saved on disk (or transmitted across the net etc.) it would be UTF-8 compressed. IOW, no compression applied to in-memory strings. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Actually, load times are not --does not seem to be-- linear at all. 4 times larger file seems to take only twice as long. I did one very simple test using 2 text files: File 1: 384 MB (403,248,710 bytes) File 2: 120 MB (126,680,448 bytes) with the code below: procedure TForm1.Button1Click(Sender: TObject); var InitialValue1: Int64; //Initial PerformanceCounter Divisor1: Int64; //Performance CounterFrequency CurrentValue1: Int64; //Current PerformanceCounter Time1: double; Time2: double; Stream1: TMemoryStream; Index1: integer; begin Memo1.Lines.Clear; QueryPerformanceFrequency(Divisor1); Index1 := 0; while Index1 100 do begin QueryPerformanceFrequency(CurrentValue1); QueryPerformanceCounter(InitialValue1); Stream1 := TMemoryStream.Create; Stream1.LoadFromFile(FILE_1); Stream1.Free; QueryPerformanceCounter(CurrentValue1); Time1 := (CurrentValue1 - InitialValue1) / Divisor1; QueryPerformanceCounter(InitialValue1); Stream1 := TMemoryStream.Create; Stream1.LoadFromFile(FILE_2); Stream1.Free; QueryPerformanceCounter(CurrentValue1); Time2 := (CurrentValue1 - InitialValue1) / Divisor1; Memo1.Lines.Add(Format('[400 MB: %3.3ns] [100 MB: %3.3ns]', [Time1, Time2])); Inc(Index1); end; end; Output: [400 MB: 0.514s] [100 MB: 0.241s] [400 MB: 0.535s] [100 MB: 0.239s] [400 MB: 0.532s] [100 MB: 0.252s] [400 MB: 0.532s] [100 MB: 0.245s] [400 MB: 0.541s] [100 MB: 0.240s] [400 MB: 0.533s] [100 MB: 0.240s] [400 MB: 0.540s] [100 MB: 0.240s] [400 MB: 0.532s] [100 MB: 0.245s] [400 MB: 0.532s] [100 MB: 0.234s] [400 MB: 0.538s] [100 MB: 0.240s] [400 MB: 0.531s] [100 MB: 0.241s] [400 MB: 0.533s] [100 MB: 0.242s] [400 MB: 0.531s] [100 MB: 0.242s] [400 MB: 0.585s] [100 MB: 0.252s] [400 MB: 0.531s] [100 MB: 0.243s] [400 MB: 0.531s] [100 MB: 0.289s] [400 MB: 0.569s] [100 MB: 0.240s] [400 MB: 0.532s] [100 MB: 0.235s] [400 MB: 0.535s] [100 MB: 0.241s] [400 MB: 0.533s] [100 MB: 0.242s] [400 MB: 0.532s] [100 MB: 0.239s] [400 MB: 0.531s] [100 MB: 0.241s] [400 MB: 0.532s] [100 MB: 0.239s] [400 MB: 0.532s] [100 MB: 0.245s] [400 MB: 0.536s] [100 MB: 0.239s] [400 MB: 0.534s] [100 MB: 0.256s] [400 MB: 0.547s] [100 MB: 0.242s] [400 MB: 0.535s] [100 MB: 0.261s] [400 MB: 0.530s] [100 MB: 0.232s] [400 MB: 0.541s] [100 MB: 0.239s] [400 MB: 0.533s] [100 MB: 0.243s] [400 MB: 0.535s] [100 MB: 0.244s] [400 MB: 0.530s] [100 MB: 0.231s] [400 MB: 0.540s] [100 MB: 0.240s] [400 MB: 0.582s] [100 MB: 0.330s] [400 MB: 0.557s] [100 MB: 0.231s] [400 MB: 0.539s] [100 MB: 0.240s] [400 MB: 0.531s] [100 MB: 0.230s] [400 MB: 0.539s] [100 MB: 0.243s] [400 MB: 0.531s] [100 MB: 0.246s] [400 MB: 0.535s] [100 MB: 0.240s] [400 MB: 0.532s] [100 MB: 0.279s] [400 MB: 0.609s] [100 MB: 0.241s] [400 MB: 0.533s] [100 MB: 0.249s] [400 MB: 0.537s] [100 MB: 0.239s] [400 MB: 0.531s] [100 MB: 0.242s] [400 MB: 0.530s] [100 MB: 0.240s] [400 MB: 0.535s] [100 MB: 0.238s] [400 MB: 0.532s] [100 MB: 0.241s] [400 MB: 0.536s] [100 MB: 0.242s] [400 MB: 0.532s] [100 MB: 0.240s] [400 MB: 0.534s] [100 MB: 0.230s] [400 MB: 0.545s] [100 MB: 0.235s] [400 MB: 0.538s] [100 MB: 0.240s] [400 MB: 0.531s] [100 MB: 0.235s] [400 MB: 0.536s] [100 MB: 0.229s] [400 MB: 0.540s] [100 MB: 0.232s] [400 MB: 0.540s] [100 MB: 0.243s] [400 MB: 0.539s] [100 MB: 0.234s] [400 MB: 0.540s] [100 MB: 0.230s] [400 MB: 0.539s] [100 MB: 0.261s] [400 MB: 0.535s] [100 MB: 0.242s] [400 MB: 0.529s] [100 MB: 0.234s] [400 MB: 0.538s] [100 MB: 0.234s] [400 MB: 0.538s] [100 MB: 0.244s] [400 MB: 0.535s] [100 MB: 0.242s] [400 MB: 0.529s] [100 MB: 0.239s] [400 MB: 0.532s] [100 MB: 0.251s] [400 MB: 0.631s] [100 MB: 0.236s] [400 MB: 0.535s] [100 MB: 0.242s] [400 MB: 0.531s] [100 MB: 0.243s] [400 MB: 0.531s] [100 MB: 0.239s] [400 MB: 0.531s] [100 MB: 0.232s] [400 MB: 0.543s] [100 MB: 0.239s] [400 MB: 0.528s] [100 MB: 0.232s] [400 MB: 0.538s] [100 MB: 0.242s] [400 MB: 0.537s] [100 MB: 0.233s] [400 MB: 0.537s] [100 MB: 0.241s] [400 MB: 0.533s] [100 MB: 0.230s] [400 MB: 0.543s] [100 MB: 0.242s] [400 MB: 0.533s] [100 MB: 0.240s] [400 MB: 0.531s] [100 MB: 0.253s] [400 MB: 0.537s] [100 MB: 0.243s] [400 MB: 0.547s] [100 MB: 0.238s] [400 MB: 0.539s] [100 MB: 0.233s] [400 MB: 0.545s] [100 MB: 0.257s] [400 MB: 0.572s] [100 MB: 0.318s] [400 MB: 0.563s] [100 MB: 0.238s] [400 MB: 0.536s] [100 MB: 0.241s] [400 MB: 0.533s] [100 MB: 0.249s] [400 MB: 0.531s] [100 MB: 0.242s] [400 MB: 0.534s] [100 MB: 0.241s] [400 MB: 0.532s] [100 MB: 0.238s] [400 MB: 0.537s] [100 MB: 0.241s] [400 MB: 0.616s] [100 MB: 0.253s] [400 MB: 0.536s] [100 MB: 0.228s] [400 MB: 0.540s] [100 MB: 0.244s] [400 MB: 0.539s] [100 MB: 0.237s] [400 MB: 0.536s] [100 MB: 0.241s] [400 MB: 0.539s] [100 MB: 0.236s] ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Graeme Geldenhuys wrote: On Sun, Nov 23, 2008 at 10:19 AM, Mattias Gaertner [EMAIL PROTECTED] wrote: On Sat, 22 Nov 2008 23:05:43 +0200 For example the lazarus IDE typically holds 50 to 200mb sources in memory. If this would be changed to unicodestring (2 byte per char) then the IDE would need 50 to 200mb more memory. Ah, and that would probably explain why Martin decided not to pre-parse units in MSEide - for things like code complection etc... MSEide's memory usage would balloon greatly, compared to Lazarus. One can always choose the string type which is most appropriate for the given task. For storing Pascal (or whatever) sources, one choice is not to use plaintext at all, but replace each identifier with its index in a dictionary. It depends on the task. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 11:09:25 +0200 listmember [EMAIL PROTECTED] wrote: I am only considering in memory representation being UTF-32 (or UCS-4). What do you mean with 'memory representation'? That, each char in a string in memory would be 4-bytes (or more); yet, when saved on disk (or transmitted across the net etc.) it would be UTF-8 compressed. IOW, no compression applied to in-memory strings. I thought my example described just that. If strings use 4 bytes per char then ASCII text will need 4 times more memory. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 13:07, Graeme Geldenhuys wrote: On Sun, Nov 23, 2008 at 12:29 PM, listmember[EMAIL PROTECTED] wrote: What I am curious about is: 4 times of what? RAM, Ramdom Access Memory, DIMMs those little green sticks you shove into the motherboard. :-) :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, Nov 23, 2008 at 1:05 PM, listmember [EMAIL PROTECTED] wrote: I just checked (using Process Explorer, under Windows) and this is what I see: Working set: 2,216 K Peak Working set: 26,988 K I can't see where that 50 MB fits into that. Well it all depends on how many files you have open, project size etc... Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
listmember wrote: This is my thick-day. So, permit me to ask this: Are you really saying that strings occupy 50 MB Lazarus's memory footprint? I just checked (using Process Explorer, under Windows) and this is what I see: Working set: 2,216 K Peak Working set: 26,988 K I can't see where that 50 MB fits into that. There's no easy way to tell how much storage the strings occupy. There are functions like GetHeapStatus and GetFPCHeapStatus, but they return the total amount of memory occupied by everything that the application allocates - objects, dyn.arrays, strings etc. However, you may hack into RTL at the NewAnsiString / NewWideString / NewUnicodeString procedures and install hooks that will record the number of bytes requested. That shouldn't be too difficult to do. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, Nov 23, 2008 at 12:29 PM, listmember [EMAIL PROTECTED] wrote: What I am curious about is: 4 times of what? RAM, Ramdom Access Memory, DIMMs those little green sticks you shove into the motherboard. :-) Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, Nov 23, 2008 at 1:13 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: I can't see where that 50 MB fits into that. Well it all depends on how many files you have open, project size etc... As an example. Using a small project, Lazarus sits at 26MB or memory. I then open the MacOSAll.pas (10.2MB text file) unit from FPC. Lazarus memory usage jumped it 80MB. So as you can see, it varies depending on what you have open etc.. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 13:05:15 +0200 listmember [EMAIL PROTECTED] wrote: On 2008-11-23 12:50, Jonas Maebe wrote: On 23 Nov 2008, at 11:29, listmember wrote: It is not hard to tell that an app that works with text files (such as Lazarus) will consume 4 times more memory per file loaded. But, how much memory does, say, Lazarus --itself-- consume specifically for string storage when run for the first time? From Matias' original answer: For example the lazarus IDE typically holds 50 to 200mb sources in memory. I.e., at least 4 times 50 to 200mb. This is my thick-day. So, permit me to ask this: Are you really saying that strings occupy 50 MB Lazarus's memory footprint? I just checked (using Process Explorer, under Windows) and this is what I see: Working set: 2,216 K Peak Working set: 26,988 K I can't see where that 50 MB fits into that. Do a 'find declaration' on an identifier, that does not exist. This will explore all units of the uses section. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
However, you may hack into RTL at the NewAnsiString / NewWideString / NewUnicodeString procedures and install hooks that will record the number of bytes requested. That shouldn't be too difficult to do. This is what I was looking for. Thank you. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Do a 'find declaration' on an identifier, that does not exist. This will explore all units of the uses section. Now I see what you mean. But, isn't this a design-choice; caching all sources in memory for speed reasons, as opposed to on-demand opening and closing each file. Still. If that is how it works, it is how it works. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Op Sun, 23 Nov 2008, schreef listmember: What I had in mind wasn't to store the string data in UTF-32 (or UCS-4); it would still be UTF-8 or whatever. I am only considering in memory representation being UTF-32 (or UCS-4). This way, loading from and saving to would hardly be affected, yet in-memory operations would be a lot faster and more simplified. For source code, en extended ASCII charset like UTF-8 is the best choice, since all characters that need processing are in the ASCII range, the code needs to do nothing about the high ASCII codes except keeping them in one part. Therefore, any other encoding is a waste of memory and does not gain you any speed. For that reason, I don't see the compiler switch from 8-bit processing either. The situation is very different when processing real text, the memory saving advantages dissappear for the majority of the world, and if you want to process characters beyond #127, UTF-16 and UTF-32 are much easier. Obviously, UTF-32 is the best encoding if there are characters you need to process are beyond #65535. Only if you need to process characters (rather than pass them on), UTF-32 is a lot faster and simpler. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 13:49, Jonas Maebe wrote: On 23 Nov 2008, at 12:35, listmember wrote: But, isn't this a design-choice; caching all sources in memory for speed reasons, as opposed to on-demand opening and closing each file. For very large projects, that should probably be done anyway at some point. But even in that case, using a more memory-efficient string type enables you to keep more data in memory and hence potentially obtain better performance. The last time I joined a relevant discussion, I was told worrying about native UCS-4 string-type would be pointless simply because that sort of thing is really needed for word processors only. Now, I have been informed that Lazarus (and perhaps other IDEs) use upwards of 50 MB string space just to do one of their basic operations. That leaves me wondering how much do we lose performance-wise in endlessly decompressing UTF-8 data, instead of using, say, UCS-4 strings. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 14:10, Daniël Mantione wrote: Therefore, any other encoding is a waste of memory and does not gain you any speed. For that reason, I don't see the compiler switch from 8-bit processing either. I nearly fully agree with you. Except that, when a string constant needs to contain non-ASCI chars. What do we do in these cases? Only if you need to process characters (rather than pass them on), UTF-32 is a lot faster and simpler. Yes. If I knew how to write this patch, I'd be working on it right now. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Op Sun, 23 Nov 2008, schreef listmember: On 2008-11-23 14:10, Daniël Mantione wrote: Therefore, any other encoding is a waste of memory and does not gain you any speed. For that reason, I don't see the compiler switch from 8-bit processing either. I nearly fully agree with you. Except that, when a string constant needs to contain non-ASCI chars. What do we do in these cases? The common approach is to do nothing, no processing needs to be done. I.e. the compiler justs passes on the bytes one by one from the source file to the object file. For an IDE, this is a little bit more complicated. I.e. searching for a ç in a source file needs to find both the composed and the decomposed variant, and in the case of UTF-8, this character can be encoded in 1, 2, 3 or 4 bytes which all need to be found. This is where UTF-16 and UTF-32 start to make sense. Only if you need to process characters (rather than pass them on), UTF-32 is a lot faster and simpler. Yes. If I knew how to write this patch, I'd be working on it right now. Unfortunately an UTF-32 string type is not on our roadmap either, so it would have to be an user contribution. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 23 Nov 2008, at 13:31, Daniël Mantione wrote: For an IDE, this is a little bit more complicated. I.e. searching for a ç in a source file needs to find both the composed and the decomposed variant, and in the case of UTF-8, this character can be encoded in 1, 2, 3 or 4 bytes which all need to be found. This is where UTF-16 and UTF-32 start to make sense. Characters can also be decomposed in UTF-16 and in UTF-32 (for the same reasons as in UTF-8). Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 12:37:32 +0100 Martin Schreiber [EMAIL PROTECTED] wrote: On Sunday 23 November 2008 09.26:35 Graeme Geldenhuys wrote: On Sun, Nov 23, 2008 at 10:19 AM, Mattias Gaertner [EMAIL PROTECTED] wrote: On Sat, 22 Nov 2008 23:05:43 +0200 For example the lazarus IDE typically holds 50 to 200mb sources in memory. If this would be changed to unicodestring (2 byte per char) then the IDE would need 50 to 200mb more memory. Ah, and that would probably explain why Martin decided not to pre-parse units in MSEide - for things like code complection etc... MSEide's memory usage would balloon greatly, compared to Lazarus. MSEide parses the code for code navigation only and on demand. For creating event handlers and the like the compiled in RTTI will be used. I decided not to parse the RTL because I wanted to be independent from the source installation and because I think the task to do exact parsing of the whole FPC RTL and other libraries is too difficult and not necessary because RTTI provides sufficient information. The parser uses 8bit strings, 16bit is used in the code editor. It is possible to work a whole day with MSEide without closing a single file and without noticeable loss of speed. MSEGui is fast and makes sophisticated use of the RTTI. I think too, that the internal format of the source editor (visual) does not matter much. But RTTI only contains published classes, does it not? Does MSEGui read ppu files? Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Op Sun, 23 Nov 2008, schreef Jonas Maebe: On 23 Nov 2008, at 13:31, Daniël Mantione wrote: For an IDE, this is a little bit more complicated. I.e. searching for a ç in a source file needs to find both the composed and the decomposed variant, and in the case of UTF-8, this character can be encoded in 1, 2, 3 or 4 bytes which all need to be found. This is where UTF-16 and UTF-32 start to make sense. Characters can also be decomposed in UTF-16 and in UTF-32 (for the same reasons as in UTF-8). I am aware of that, but the combining cedille is not in the easy to process range of UTF-8. In other words, you cannot do if char[i]=combining_cedille in UTF-8. Instead UTF-8, you need to make sure the string has enough characters left, and then compare multiple characters. Heck, you even need to take care of the fact the the combining cedille can be encoded in 2, 3 or 4 bytes. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 14:11:50 +0200 listmember [EMAIL PROTECTED] wrote: [...] For very large projects, that should probably be done anyway at some point. But even in that case, using a more memory-efficient string type enables you to keep more data in memory and hence potentially obtain better performance. The last time I joined a relevant discussion, I was told worrying about native UCS-4 string-type would be pointless simply because that sort of thing is really needed for word processors only. Now, I have been informed that Lazarus (and perhaps other IDEs) use upwards of 50 MB string space just to do one of their basic operations. That leaves me wondering how much do we lose performance-wise in endlessly decompressing UTF-8 data, instead of using, say, UCS-4 strings. I'm wondering what you mean with 'endlessly decompressing UTF-8 data'. You have to make a compromise between memory, ease of use and compatibility. There is no solution without drawbacks. If you want to process large 8bit text files then UTF-8 is better. If you want to paint glyphs then normalized UTF-32 is better. If you want some unicode with some mem overhead and some easy usage and have compiler support for some compatibility then UTF-16 is better. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sunday 23 November 2008 09.26:35 Graeme Geldenhuys wrote: On Sun, Nov 23, 2008 at 10:19 AM, Mattias Gaertner [EMAIL PROTECTED] wrote: On Sat, 22 Nov 2008 23:05:43 +0200 For example the lazarus IDE typically holds 50 to 200mb sources in memory. If this would be changed to unicodestring (2 byte per char) then the IDE would need 50 to 200mb more memory. Ah, and that would probably explain why Martin decided not to pre-parse units in MSEide - for things like code complection etc... MSEide's memory usage would balloon greatly, compared to Lazarus. MSEide parses the code for code navigation only and on demand. For creating event handlers and the like the compiled in RTTI will be used. I decided not to parse the RTL because I wanted to be independent from the source installation and because I think the task to do exact parsing of the whole FPC RTL and other libraries is too difficult and not necessary because RTTI provides sufficient information. The parser uses 8bit strings, 16bit is used in the code editor. It is possible to work a whole day with MSEide without closing a single file and without noticeable loss of speed. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
I thought my example described just that. If strings use 4 bytes per char then ASCII text will need 4 times more memory. I am not disputing that. What I am curious about is: 4 times of what? It is not hard to tell that an app that works with text files (such as Lazarus) will consume 4 times more memory per file loaded. But, how much memory does, say, Lazarus --itself-- consume specifically for string storage when run for the first time? This is what I am after. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 12:50, Jonas Maebe wrote: On 23 Nov 2008, at 11:29, listmember wrote: It is not hard to tell that an app that works with text files (such as Lazarus) will consume 4 times more memory per file loaded. But, how much memory does, say, Lazarus --itself-- consume specifically for string storage when run for the first time? From Matias' original answer: For example the lazarus IDE typically holds 50 to 200mb sources in memory. I.e., at least 4 times 50 to 200mb. This is my thick-day. So, permit me to ask this: Are you really saying that strings occupy 50 MB Lazarus's memory footprint? I just checked (using Process Explorer, under Windows) and this is what I see: Working set: 2,216 K Peak Working set: 26,988 K I can't see where that 50 MB fits into that. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 13:35:07 +0200 listmember [EMAIL PROTECTED] wrote: Do a 'find declaration' on an identifier, that does not exist. This will explore all units of the uses section. Now I see what you mean. But, isn't this a design-choice; caching all sources in memory for speed reasons, as opposed to on-demand opening and closing each file. The codetools do almost everything on demand. Needed sources are parsed, put together (include files) and cleaned from dead code (IFDEFs), trees are built and find declaration results are cached. This costs a lot of time (more than the compiler doing the same). But if something changed the codetools know what to rebuild while OTOH FPC has to rebuilt everything. The naked source itself normally takes less than 25%. For example: You can change the declaration of 'integer'. FPC would now need to recompile almost every unit. But you will hardly notice much work of the codetools. These dependencies are complex and require exclusive access. The memory belongs to the program, the source files can be changed by anyone. Therefore the files are kept in memory and auto reloaded if they change on disk. Still. If that is how it works, it is how it works. Many applications use strings for text files. As soon as they don't fit into the CPU cache you get a performance decrease. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, 23 Nov 2008 13:49:32 +0100 (CET) Daniël Mantione [EMAIL PROTECTED] wrote: Op Sun, 23 Nov 2008, schreef Jonas Maebe: On 23 Nov 2008, at 13:31, Daniël Mantione wrote: For an IDE, this is a little bit more complicated. I.e. searching for a ç in a source file needs to find both the composed and the decomposed variant, and in the case of UTF-8, this character can be encoded in 1, 2, 3 or 4 bytes which all need to be found. This is where UTF-16 and UTF-32 start to make sense. Characters can also be decomposed in UTF-16 and in UTF-32 (for the same reasons as in UTF-8). I am aware of that, but the combining cedille is not in the easy to process range of UTF-8. In other words, you cannot do if char[i]=combining_cedille in UTF-8. Instead UTF-8, you need to make sure the string has enough characters left, and then compare multiple characters. Heck, you even need to take care of the fact the the combining cedille can be encoded in 2, 3 or 4 bytes. Which means that there are three different unicode codes for this character, which means a single if-equal does not work in UTF-16 or UTF32 too. if UTF8CharacterToUnicode(@s[i],CharLen) in [cedille1,cedille2,cedille3] then Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Daniël Mantione wrote: Instead UTF-8, you need to make sure the string has enough characters left, and then compare multiple characters. Heck, you even need to take care of the fact the the combining cedille can be encoded in 2, 3 or 4 bytes. In this example it may be more efficient to encode three variants of cedilla into utf8 and do three searches with Pos(), instead of decoding the whole target string. It depends, of course - at least at how long the target string is. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 23 Nov 2008, at 12:35, listmember wrote: Do a 'find declaration' on an identifier, that does not exist. This will explore all units of the uses section. Now I see what you mean. But, isn't this a design-choice; caching all sources in memory for speed reasons, as opposed to on-demand opening and closing each file. For very large projects, that should probably be done anyway at some point. But even in that case, using a more memory-efficient string type enables you to keep more data in memory and hence potentially obtain better performance. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
In our previous episode, listmember said: Is there a way to determine how much memory is consumed by strings by a running application? Maybe you can keep a counter in the routines of astrings. Increase/adjust on newansistring or setlength. I'd like to know this, in particular, for FPC ana Lazarus --to begin with. And, the reason I'd like to know this is this: Whenever I suggest that char size be increased to 4, the idea gets opposed on the grouds that it will need huge memory --4 times as much. That's not the only reason: - more memory also means slower copy. - Most OSes seem to use uTF-8 and UTF-16, with -32 you would an island, and the avg text editors might not be able to read what you write There's of course some merit in that arguement, but I have no idea what it is '4 times' of. This is not very engineer-like --it being unmeasured. It is highly dependant on use. An attempt on a single application says nothing. The app that I work on for a living has maybe 0.5MB of strings, and hardly any time consuming processing. (mostly a simple logfile). In previous jobs however I have done database-in-memory, database pumps and importers, and there it matters. Can anyone suggest a way to measure the memory load caused by strings? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
In our previous episode, listmember said: The last time I joined a relevant discussion, I was told worrying about native UCS-4 string-type would be pointless simply because that sort of thing is really needed for word processors only. Now, I have been informed that Lazarus (and perhaps other IDEs) use upwards of 50 MB string space just to do one of their basic operations. That leaves me wondering how much do we lose performance-wise in endlessly decompressing UTF-8 data, instead of using, say, UCS-4 strings. If you leave about character composition you don't need to for e.g. an often used primitives like compare an identifier ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 14:34, Mattias Gaertner wrote: On Sun, 23 Nov 2008 14:11:50 +0200 listmember[EMAIL PROTECTED] wrote: That leaves me wondering how much do we lose performance-wise in endlessly decompressing UTF-8 data, instead of using, say, UCS-4 strings. I'm wondering what you mean with 'endlessly decompressing UTF-8 data'. I am referring to going to the nth character in a string. With UTF-8 it is no more a simple arithmetic and an index operation. You have to start from zero and iterate until you get to your characters --at every step, calculating whether it is 2, 3 or 4 bytes long. Doing this is decompression. You have to make a compromise between memory, ease of use and compatibility. There is no solution without drawbacks. If you want to process large 8bit text files then UTF-8 is better. If you want to paint glyphs then normalized UTF-32 is better. If you want some unicode with some mem overhead and some easy usage and have compiler support for some compatibility then UTF-16 is better. Do we have to think in terms of encodings (which are, ways of compressing text) when what we actually mean 1-byte, 2-byte and 4-byte per char strings. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 14:19, Mattias Gaertner wrote: On Sun, 23 Nov 2008 13:35:07 +0200 listmember[EMAIL PROTECTED] wrote: [...] These dependencies are complex and require exclusive access. The memory belongs to the program, the source files can be changed by anyone. Therefore the files are kept in memory and auto reloaded if they change on disk. Makes sense. Thank you for explaining it. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 14:49, Daniël Mantione wrote: Op Sun, 23 Nov 2008, schreef Jonas Maebe: On 23 Nov 2008, at 13:31, Daniël Mantione wrote: For an IDE, this is a little bit more complicated. I.e. searching for a ç in a source file needs to find both the composed and the decomposed variant, and in the case of UTF-8, this character can be encoded in 1, 2, 3 or 4 bytes which all need to be found. This is where UTF-16 and UTF-32 start to make sense. Characters can also be decomposed in UTF-16 and in UTF-32 (for the same reasons as in UTF-8). I am aware of that, but the combining cedille is not in the easy to process range of UTF-8. In other words, you cannot do if char[i]=combining_cedille in UTF-8. Instead UTF-8, you need to make sure the string has enough characters left, and then compare multiple characters. Heck, you even need to take care of the fact the the combining cedille can be encoded in 2, 3 or 4 bytes. This is one of the million and one small details that one has to keep in mind while programming. What I think would more sensible is that, instead of using all these variable sizes and all, simply use 4-byte/char strings and compose (in UTF sense) everything into that string. You do this once, when importing/loading text to your app. And, then on, everthing is just like the good old string --except that it is a 4-byte per char string, instead of 1-byte. Now, my question is this: How would I create a 'FourByteString' type, reference counted etc. just like the usual 'String'? How hard is it? Can someone like me, who does nor speak assembler, do it? If so, where do I begin copypasting from 'string'? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sunday 23 November 2008 13.44:02 Mattias Gaertner wrote: But RTTI only contains published classes, does it not? AFAIK there are some more elements where is is possible to get a typeinfo pointer. A compiler specialist can say more. :-) Does MSEGui read ppu files? No. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
In our previous episode, Martin Schreiber said: [ Charset ISO-8859-1 unsupported, converting... ] On Sunday 23 November 2008 13.44:02 Mattias Gaertner wrote: But RTTI only contains published classes, does it not? AFAIK there are some more elements where is is possible to get a typeinfo pointer. A compiler specialist can say more. :-) Well, I'm not an expert, but I can only think of enumerations. These have RTTI under Delphi because they are shown in the Object Inspector. And afaik that's it? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 15:10, Marco van de Voort wrote: In our previous episode, listmember said: [].. I'd like to know this, in particular, for FPC ana Lazarus --to begin with. And, the reason I'd like to know this is this: Whenever I suggest that char size be increased to 4, the idea gets opposed on the grouds that it will need huge memory --4 times as much. That's not the only reason: - more memory also means slower copy. True. But, being multiples of 4-bytes, may compenmsate for it. Don't quote me on this though. - Most OSes seem to use uTF-8 and UTF-16, with -32 you would an island, and the avg text editors might not be able to read what you write The answer to that is this: 1) When inputting/outputting text to/from file or the OS, you use UTF-8 (or whatever is native/required). 2) You do not make UTF-32 mandatory. But, it should be there for those (and those cases) that need it. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
Op Sun, 23 Nov 2008, schreef Marco van de Voort: In our previous episode, Martin Schreiber said: [ Charset ISO-8859-1 unsupported, converting... ] On Sunday 23 November 2008 13.44:02 Mattias Gaertner wrote: But RTTI only contains published classes, does it not? AFAIK there are some more elements where is is possible to get a typeinfo pointer. A compiler specialist can say more. :-) Well, I'm not an expert, but I can only think of enumerations. These have RTTI under Delphi because they are shown in the Object Inspector. And afaik that's it? The compiler uses RTTI to copy data structures with dynamic data type inside. I.e. records have RTTI because there might be a widestring inside, the RTL to do e.g. an assignment correctly. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
In our previous episode, Dani?l Mantione said: AFAIK there are some more elements where is is possible to get a typeinfo pointer. A compiler specialist can say more. :-) Well, I'm not an expert, but I can only think of enumerations. These have RTTI under Delphi because they are shown in the Object Inspector. And afaik that's it? The compiler uses RTTI to copy data structures with dynamic data type inside. I.e. records have RTTI because there might be a widestring inside, the RTL to do e.g. an assignment correctly. Ah, didn't know the intiializer/finalizer tables can be accessed/walked using typeinfo too. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re[2]: [fpc-devel] Memory consumed by strings
Hello Daniël, Sunday, November 23, 2008, 1:49:32 PM, you wrote: DM I am aware of that, but the combining cedille is not in the easy to DM process range of UTF-8. In other words, you cannot do DM if char[i]=combining_cedille in UTF-8. DM Instead UTF-8, you need to make sure the string has enough characters DM left, and then compare multiple characters. Heck, you even need to take DM care of the fact the the combining cedille can be encoded in 2, 3 or 4 DM bytes. Combined and uncombined strings are different things for different tasks, the only common point is that both have the same visual representation, but unicode function CharAt (or alike) over uncombined string must never report the combined character as a result. Some functions are designed to work over uncombined strings and other over combined ones, because some things can not be done over one of the formats. -- Best regards, JoshyFun ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re[2]: [fpc-devel] Memory consumed by strings
Op Sun, 23 Nov 2008, schreef JoshyFun: Combined and uncombined strings are different things for different tasks, the only common point is that both have the same visual representation, but unicode function CharAt (or alike) over uncombined string must never report the combined character as a result. I was not claiming that :) Instead I was saying that Edit-Search in an IDE is expected to find both. Some functions are designed to work over uncombined strings and other over combined ones, because some things can not be done over one of the formats. True. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re[3]: [fpc-devel] Memory consumed by strings
Hello Daniël, Sunday, November 23, 2008, 5:21:16 PM, you wrote: Combined and uncombined strings are different things for different tasks, the only common point is that both have the same visual representation, but unicode function CharAt (or alike) over uncombined string must never report the combined character as a result. DM I was not claiming that :) Instead I was saying that Edit-Search in an DM IDE is expected to find both. Yes, I know, but an Edit - Search is, in example in Lazarus, working over a composed data as it has not sense to use decomposed data in a source editor (unless I'm wrong of course). -- Best regards, JoshyFun ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On Sun, Nov 23, 2008 at 3:45 PM, listmember [EMAIL PROTECTED] wrote: I am referring to going to the nth character in a string. With UTF-8 it is no more a simple arithmetic and an index operation. You have to start from zero and iterate until you get to your characters --at every step, calculating whether it is 2, 3 or 4 bytes long. Doing this is decompression. Well if the string is well formed UTF-8, the first byte of each character will tell you how far to jump ahead, so you don't need to visit each byte. With UTF-16, you also can't just jump to the n'th character. It also needs special attention to check for surrogate pairs. At least the good thing of UTF-8 is that you don't have to worry about LE or BE byte orders. UTF-16 and UTF-32 have that nasty issue. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Memory consumed by strings
On 2008-11-23 19:31, Graeme Geldenhuys wrote: At least the good thing of UTF-8 is that you don't have to worry about LE or BE byte orders. UTF-16 and UTF-32 have that nasty issue. LE/BE only applies when streaming to/from file/device/network, otherwise life is much simpler with UTF-32. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] Memory consumed by strings
Is there a way to determine how much memory is consumed by strings by a running application? I'd like to know this, in particular, for FPC ana Lazarus --to begin with. And, the reason I'd like to know this is this: Whenever I suggest that char size be increased to 4, the idea gets opposed on the grouds that it will need huge memory --4 times as much. There's of course some merit in that arguement, but I have no idea what it is '4 times' of. This is not very engineer-like --it being unmeasured. Can anyone suggest a way to measure the memory load caused by strings? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel