On Mon, Jul 7, 2008 at 2:36 AM, Andrew Farmer <[EMAIL PROTECTED]> wrote:
> On 06 Jul 08, at 23:24, Omar Qazi wrote:
>>
>> On Jul 6, 2008, at 7:46 AM, Martin Hairer wrote:
>>>
>>> This works like a treat and is faster by a factor 3 or so than using
>>> the "Moriarity" implementation. However, it leaves me a bit concerned
>>> about various warnings all over the place concerning the thread
>>> (un)safety of NSTask and NSFileHandle. So my question is: is the kind
>>> of approach that I am taking doable / reasonable? If not, is there an
>>> alternative way of doing this which is more efficient  than the
>>> "Moriarty" code? Thanks a lot in advance for any help / hint,
>>
>>        NSFileHandle *msgHandle = [standardInput fileHandleForReading];
>>        [msgHandle waitForDataInBackgroundAndNotify];
>>
>> - (void)newMessage:(NSNotification *)notification {
>>        NSString *strOutput = [[NSString alloc]initWithData:[msgHandle
>> availableData] encoding:NSUTF8StringEncoding];
>>        //Process the data
>>        [msgHandle waitForDataInBackgroundAndNotify];
>> }
>
> I'd be very careful with reading string input like that. It's entirely
> possible for a multi-byte character (é, for example, is represented as C3
> 89) to be split across two separate data chunks, which'll make NSString very
> confused and angry.
>
> I'm not quite sure what the correct solution is here, though. There's got to
> be some easier solution than checking for sequence completeness by
> hand...

There are really three reasonable choices:

1) Gather all of the data into an NSMutableData buffer, then create an
NSString from it when the task terminates. This obviously doesn't work
so well if you want to display the output while the task is still
running, but it's very easy.

2) Look for an ASCII delimeter. UTF-8 is ASCII-compatible, which means
that if you see something in the stream that looks like a particular
ASCII character or character sequence, then it *is* that character or
sequence. So, for example, you can have an NSMutableData buffer, then
search the incoming data for the ASCII character '\n' or '\r' and
break it apart in those locations.

3) Look for a clean break in the UTF-8 sequence. This is not as
difficult as it sounds. There are two easy scenarios where you can
break. The first is after any ASCII character. You can scan your
NSMutableData buffer for any char value <= 127, and break at that
location. Second, you can break *before* any char value that matches
this mask:

    c & 0xA == 0xA

This will find a char whose first two bits are both 1. In UTF-8, this
denotes the first character in a multi-byte sequence, so you know that
if you break right before that location, it's a safe place.

If you want to get fancier, it's possible to read the rest of that
first byte in the sequence and find out how long the sequence is, then
break at the end of it if you have all of it. But if you don't mind
leaving one extra character in your buffer from time to time (which
will get flushed out when more data arrives), then this is fine.

I should note that in all of these situations you should never assume
the data is always UTF-8. There's no requirement for it to be, it is
at best just a convention. Be prepared for the conversion to an
NSString to fail, and have some sort of reasonable fallback (flag an
error, try a more permissive encoding) in that case.

If anyone is interested in learning more about how UTF-8 works and how
you can parse it, the Wikipedia article is quite good:

http://en.wikipedia.org/wiki/UTF-8

It's a surprisingly simple format and it's easy to manipulate it
directly if you know a little bit about masking and bitshifting in C.

Mike
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to