On Jan 10, 2013, at 1:28 PM, mickeyf <[email protected]> wrote:
> The string itself displays as expected, but shows a length of twice the
> number of characters, as if String.Length is reporting the number of bytes
> (UTF16) rather than the number of Unicode characters in the string.
In all likelihood, the string contains non-printable characters. Consider this
`csharp` snippet:
csharp> var b = new byte[]{(byte) 'a', (byte) 'b', 0, 0, 0, 0};
csharp> var s = System.Text.Encoding.UTF8.GetString(b);
csharp> s.Length
6
csharp> s;
"ab"
So this is more or less exactly what you're describing; `s` _clearly_ has two
characters, yet s.Length is 6!
Except `s` doesn't have two characters:
csharp> [3];
'\x0
There's some null data in there, because our source byte array contained null
bytes, and System.String can contain ASCII NUL characters, which `b` contains.
You can confirm/deny this by seeing that `buffFromDrv` actually contains, and
see if it has any non-printable data (e.g. ASCII NUL).
Assuming that's the case, what you need to do is not convert "extra" data:
byte[] buffFromDrv = new byte [BIG_ENOUGH];
int bytesRead = stream.Read(buffFromDrv, readPosition, bytesToRead);
string s = System.Text.UTF8Encoding.UTF8.GetString(buffFromDrv, 0,
bytesRead);
Or for the above `csharp` snippet:
csharp> var s = System.Text.Encoding.UTF8.GetString(b, 0, 2);
csharp> s;
"ab"
csharp> s.Length;
2
> The documentation for string.length says "number of characters", not "number
> of bytes",
It's actually neither; String.Length is the number of UTF-16 "code units"
stored in the string. This is _not_ the number of "characters" ("code points"),
because a code point may require the use of a "surrogate pair", in which case
it will take up two `char` values within the string:
http://en.wikipedia.org/wiki/UTF-16
(Normally you don't need to care about this, except when you do...)
- Jon
_______________________________________________
Mono-list maillist - [email protected]
http://lists.ximian.com/mailman/listinfo/mono-list