Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP
On 17/Oct/10 8:02 AM, Michael Snoyman wrote: In the gist you sent, the problem is that you are reading the HTTP response as a String. The HTTP library doesn't deal well with non-Latin characters when doing String requests; you should be using ByteString and then converting. It's a little tedious using the HTTP library with ByteStrings, which is one of the reasons I wrote http-enumerator. Here's some working code. The main point is to convert the UTF8 octets to a String. You could also consider using one of the JSON libraries that support bytestrings directly instead of strings, which will likely result in much better performance. Contenders include JSONb[1] and yajl-enumerator[2]. import Network.HTTP.Enumerator import qualified Text.JSON as JSON import qualified Data.ByteString.Lazy.UTF8 as BSLU data GithubUser = GithubUser { name :: String, location :: String } deriving (Eq, Show) instance JSON.JSON GithubUser where readJSON (JSON.JSObject object) = let (Just a) = lookupM user $ JSON.fromJSObject object (JSON.JSObject b) = a user = JSON.fromJSObject b in do name- lookupM name user= JSON.readJSON location- lookupM location user= JSON.readJSON return $ GithubUser { name = name, location = location } showJSON user = JSON.makeObj [ (name, JSON.showJSON $ name user), (location, JSON.showJSON $ location user) ] lookupM :: (Monad m) = String - [(String, a)] - m a lookupM x xs = maybe (fail $ No such element: ++ x) return (lookup x xs) main = do jsonLbs- simpleHttp http://github.com/api/v2/json/user/show/igstan; let jsonText = BSLU.toString jsonLbs let result = JSON.decode jsonText :: JSON.Result GithubUser showResult result where showResult (JSON.Ok json) = putStrLn $ name json showResult (JSON.Error e) = putStrLn e Michael [1] http://hackage.haskell.org/package/JSONb-1.0.2 [2] http://hackage.haskell.org/package/yajl-enumerator Thanks Michael, now it works indeed. But I don't understand, is there any inherent problem with Haskell's built-in String? Should one choose ByteString when dealing with Unicode stuff? Or, is there any resource that describes in one place all the problems Haskell has with Unicode? -- Ionuț G. Stan | http://igstan.ro ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP
On 17/Oct/10 3:37 PM, Michael Snoyman wrote: On Sun, Oct 17, 2010 at 2:26 PM, Ionut G. Stanionut.g.s...@gmail.com wrote: Thanks Michael, now it works indeed. But I don't understand, is there any inherent problem with Haskell's built-in String? Should one choose ByteString when dealing with Unicode stuff? Or, is there any resource that describes in one place all the problems Haskell has with Unicode? There's no problem with String; you just need to remember what it means. A String is a list of Chars, and a Char is a unicode codepoint. On the other hand, the HTTP protocol deals with *bytes*, not Unicode codepoints. In order to convert between the two, you need some type of encoding; in the case of JSON, I believe this is always specified as UTF-8. The problem for you is that the HTTP package does *not* perform UTF-8 decoding of the raw bytes sent over the network. Instead, I believe it is doing the naive byte-to-codepoint conversion, aka Latin-1 decoding. By downloading the data as bytes (ie, a ByteString), you can then explicitly state that you want to do UTF-8 decoding instead of Latin-1. It would be entirely possible to write an HTTP library that does this automatically, but it would be inherently limited to a single encoding type. By dealing directly with bytestrings, you can work with any character encoding, as well as binary data such as images which does not have any character encoding. OK, I think I understand now. I was under the assumption that the Network.HTTP package will take a look at the Content-Type header and do a behind-the-scene conversion before decoding those bytes. Thanks for your help. -- Ionuț G. Stan | http://igstan.ro ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP
On Oct 17, 2010, at 5:37 AM, Michael Snoyman mich...@snoyman.com wrote: in the case of JSON, I believe this is always specified as UTF-8. RFC 4627 section 3 says that JSON must be encoded in Unicode, but all encodings are acceptable. The encoding is inferred by the firsthand four octets. So you need to be prepared to decode UTF-16 and UTF-32 and endian variants. - Mark___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP
Hi, I'm trying to decode this JSON response: http://github.com/api/v2/json/user/show/igstan As you can see, the name field contains a non-latin character: ț, and it appears that Text.JSON. can't decode this response when it comes from Network.HTTP. I've tried Network.HTTP.Enumerator too, but the problem persists. Here's a simple (hopefully) reproducible test case: http://gist.github.com/630319 If you load it in ghci and call main, you'll see that it doesn't properly show the user name. Also, calling: request http://github.com/api/v2/json/user/show/igstan; will display the respective character encoded in a way that I have no idea whether or not is correct (Unicode is not one of my strong points for the moment). Can anyone shed some light on this problem? Which package is the culprit? Thanks, -- Ionuț G. Stan | http://igstan.ro ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe