On 4 Aug 2011, at 6:49 AM, Andreas Grosam wrote:

> I want to create a CFString using function CFStringCreateWithBytes.
> 
> CFStringRef CFStringCreateWithBytes (
>   CFAllocatorRef alloc,
>   const UInt8 *bytes,
>   CFIndex numBytes,
>   CFStringEncoding encoding,
>   Boolean isExternalRepresentation
> );
> 
> I suspect, the "encoding" parameter refers to the encoding of the source 
> string.

The thing to bear in mind is that it is the encoding of the _source_ string. 
It's a fact about the bytes you're importing. Facts about the data aren't 
changeable at runtime, so there isn't a choice you can make when you call 
CFStringCreateWithBytes. 

> My source buffer containing the string can be encoded in UTF-16LE or UTF-16BE.
> I don't want to have a BOM in the resulting CFString - and the source buffer 
> does not contain it either.

CFString is an opaque type. You don't know how it stores its characters 
internally, and you shouldn't have to care. It might store endianness as a BOM 
in a character buffer, or as a flag in an associated data structure, or it 
might have a preferred internal endianness that you never see from the outside. 
It may or may not store the characters as UTF-16 (either endianness) at all. 
These details may vary by architecture, version of Core Foundation, and even 
from string to string.

"[T]he need for a BOM arises in the context of text interchange, rather than in 
normal text processing within a closed environment." — Wikipedia, "Byte order 
mark," <http://en.wikipedia.org/wiki/Byte_order_mark>

> The documentation does not tell me which source encoding would be the most 
> preferred to initialize the CFString in the most efficient manner. I would 
> guess this is UTF-16LE on Intel machines.

If you mean that you have control over how the bytes in the source data were 
originally written, little-endian may be a good choice, but it's only a guess, 
and guesses about the "efficiency" of opaque functions are worthless. If Core 
Foundation doesn't always use UTF-16 internally, there may be a conversion 
anyway, and the efficiency of the source is at most a minor consideration.

If I were less lazy, I'd look at the source of CFLite, and know for sure. The 
best way to know, however is not to guess. Prepare your source text in both 
orders, and benchmark CFStringCreateWithBytes each way. That way, you can get 
the answer that matches your actual use. You may find that byte order makes so 
little difference in speed that it needn't be a consideration.

Wikipedia says that the Unicode standard says that if there is no BOM, you 
assume the byte stream is big-endian. So if your first priority is to avoid a 
BOM, your choice is made for you: Pass kCFStringEncodingUTF16BE. Correctness is 
a much bigger consideration than the presence of two bytes. One of my slogans 
is that it's a false economy to get the wrong answer as quickly as possible.

However, assuming big-endian assumes you absolutely trust every writer of your 
source stream. If you accept a BOM, you'll be able to handle more inputs. 
Otherwise, try big-endian, and if CFStringCreateWithBytes returns NULL, try 
again with little-endian.

> And what happens if I just specify kCFStringEncodingUTF16 ?  Is then the 
> source encoding assumed to be in host endianness? Or UTF-16BE as the Unicode 
> Standard suggests?

Possibly CFStringCreateWithBytes tries it both ways, and accepts the way that 
doesn't error. Maybe, to favor the standard behavior, it tries big-endian 
first. I haven't looked at the source, and can't tell you for sure. The thing 
to do is _test_, with the kind of data you'll actually use, and you'll know.

        — F

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to