On Monday, 23 March 2015 at 19:25:08 UTC, Tobias Pankrath wrote:
I made the same test in C# using a 30MB plain ASCII text file.
Compared to fastest method proposed by Andrei, results are not
the best:
D:
readText.representation.count!(c => c == '\n') - 428 ms
byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms
C#:
File.ReadAllLines.Length - 216 ms;
Win64, D 2.066.1, Optimizations were turned on in both cases.
The .net code is clearly not performance oriented
(http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26),
I suspect that .net runtime is performing some optimizations
under the hood.
Does the C# version validate the input? Using std.file.read
instead of readText.representation halves the runtime on my
machine.
Source code is available at the link above. Since the C# version
works internally with streams and UTF-16 chars, the pseudocode
looks like this:
---
initilialize a LIST with 16 items;
while (!eof)
{
read 4096 bytes in a buffer;
decode them to UTF-16 in a wchar[] buffer
while (moredata in the buffer)
{
read from buffer until (\n or \r\n or \r);
discard end of line;
if (nomorespace in LIST)
double its size.
add the line to LIST.
}
}
return number of items in the LIST.
---
Since this code is clearly not the best for this task, as I
suspected, I looked into jitted code and it seems that the .net
runtime is smart enough to recognize this pattern and is doing
the following:
- file is mapped into memory using CreateFileMapping
- does not perform any decoding, since \r and \n are ASCII
- does not create any list
- searches incrementally for \r, \r\n, \n using CompareStringA
and LOCALE_INVARIANT and increments at each end of line
- there is no temporary memory allocation since searching is
performed directly on the mapping handle
- returns the count.