On Mon, Jul 7, 2008 at 2:36 AM, Andrew Farmer <[EMAIL PROTECTED]> wrote: > On 06 Jul 08, at 23:24, Omar Qazi wrote: >> >> On Jul 6, 2008, at 7:46 AM, Martin Hairer wrote: >>> >>> This works like a treat and is faster by a factor 3 or so than using >>> the "Moriarity" implementation. However, it leaves me a bit concerned >>> about various warnings all over the place concerning the thread >>> (un)safety of NSTask and NSFileHandle. So my question is: is the kind >>> of approach that I am taking doable / reasonable? If not, is there an >>> alternative way of doing this which is more efficient than the >>> "Moriarty" code? Thanks a lot in advance for any help / hint, >> >> NSFileHandle *msgHandle = [standardInput fileHandleForReading]; >> [msgHandle waitForDataInBackgroundAndNotify]; >> >> - (void)newMessage:(NSNotification *)notification { >> NSString *strOutput = [[NSString alloc]initWithData:[msgHandle >> availableData] encoding:NSUTF8StringEncoding]; >> //Process the data >> [msgHandle waitForDataInBackgroundAndNotify]; >> } > > I'd be very careful with reading string input like that. It's entirely > possible for a multi-byte character (é, for example, is represented as C3 > 89) to be split across two separate data chunks, which'll make NSString very > confused and angry. > > I'm not quite sure what the correct solution is here, though. There's got to > be some easier solution than checking for sequence completeness by > hand...
There are really three reasonable choices: 1) Gather all of the data into an NSMutableData buffer, then create an NSString from it when the task terminates. This obviously doesn't work so well if you want to display the output while the task is still running, but it's very easy. 2) Look for an ASCII delimeter. UTF-8 is ASCII-compatible, which means that if you see something in the stream that looks like a particular ASCII character or character sequence, then it *is* that character or sequence. So, for example, you can have an NSMutableData buffer, then search the incoming data for the ASCII character '\n' or '\r' and break it apart in those locations. 3) Look for a clean break in the UTF-8 sequence. This is not as difficult as it sounds. There are two easy scenarios where you can break. The first is after any ASCII character. You can scan your NSMutableData buffer for any char value <= 127, and break at that location. Second, you can break *before* any char value that matches this mask: c & 0xA == 0xA This will find a char whose first two bits are both 1. In UTF-8, this denotes the first character in a multi-byte sequence, so you know that if you break right before that location, it's a safe place. If you want to get fancier, it's possible to read the rest of that first byte in the sequence and find out how long the sequence is, then break at the end of it if you have all of it. But if you don't mind leaving one extra character in your buffer from time to time (which will get flushed out when more data arrives), then this is fine. I should note that in all of these situations you should never assume the data is always UTF-8. There's no requirement for it to be, it is at best just a convention. Be prepared for the conversion to an NSString to fail, and have some sort of reasonable fallback (flag an error, try a more permissive encoding) in that case. If anyone is interested in learning more about how UTF-8 works and how you can parse it, the Wikipedia article is quite good: http://en.wikipedia.org/wiki/UTF-8 It's a surprisingly simple format and it's easy to manipulate it directly if you know a little bit about masking and bitshifting in C. Mike
_______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]