Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

2010-10-17 Thread Ionut G. Stan

On 17/Oct/10 8:02 AM, Michael Snoyman wrote:

In the gist you sent, the problem is that you are reading the HTTP
response as a String. The HTTP library doesn't deal well with
non-Latin characters when doing String requests; you should be using
ByteString and then converting. It's a little tedious using the HTTP
library with ByteStrings, which is one of the reasons I wrote
http-enumerator. Here's some working code. The main point is to
convert the UTF8 octets to a String.

You could also consider using one of the JSON libraries that support
bytestrings directly instead of strings, which will likely result in
much better performance. Contenders include JSONb[1] and
yajl-enumerator[2].

import Network.HTTP.Enumerator
import qualified Text.JSON as JSON
import qualified Data.ByteString.Lazy.UTF8 as BSLU

data GithubUser = GithubUser {
 name :: String,
 location :: String
 } deriving (Eq, Show)


instance JSON.JSON GithubUser where
 readJSON (JSON.JSObject object) =
 let (Just a)  = lookupM user $ JSON.fromJSObject object
 (JSON.JSObject b) = a
 user  = JSON.fromJSObject b
 in do name- lookupM name user= JSON.readJSON
   location- lookupM location user= JSON.readJSON
   return $ GithubUser {
   name = name,
   location = location
   }

 showJSON user = JSON.makeObj [
 (name, JSON.showJSON $ name user),
 (location, JSON.showJSON $ location user)
 ]


lookupM :: (Monad m) =  String -  [(String, a)] -  m a
lookupM x xs = maybe (fail $ No such element:  ++ x) return (lookup x xs)

main = do jsonLbs- simpleHttp http://github.com/api/v2/json/user/show/igstan;
   let jsonText = BSLU.toString jsonLbs
   let result = JSON.decode jsonText :: JSON.Result GithubUser
   showResult result
where showResult (JSON.Ok json) = putStrLn $ name json
  showResult (JSON.Error e) = putStrLn e

Michael

[1] http://hackage.haskell.org/package/JSONb-1.0.2
[2] http://hackage.haskell.org/package/yajl-enumerator


Thanks Michael, now it works indeed. But I don't understand, is there 
any inherent problem with Haskell's built-in String? Should one choose 
ByteString when dealing with Unicode stuff? Or, is there any resource 
that describes in one place all the problems Haskell has with Unicode?


--
Ionuț G. Stan  |  http://igstan.ro
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

2010-10-17 Thread Ionut G. Stan

On 17/Oct/10 3:37 PM, Michael Snoyman wrote:

On Sun, Oct 17, 2010 at 2:26 PM, Ionut G. Stanionut.g.s...@gmail.com  wrote:

Thanks Michael, now it works indeed. But I don't understand, is there any
inherent problem with Haskell's built-in String? Should one choose
ByteString when dealing with Unicode stuff? Or, is there any resource that
describes in one place all the problems Haskell has with Unicode?


There's no problem with String; you just need to remember what it
means. A String is a list of Chars, and a Char is a unicode codepoint.
On the other hand, the HTTP protocol deals with *bytes*, not Unicode
codepoints. In order to convert between the two, you need some type of
encoding; in the case of JSON, I believe this is always specified as
UTF-8.

The problem for you is that the HTTP package does *not* perform UTF-8
decoding of the raw bytes sent over the network. Instead, I believe it
is doing the naive byte-to-codepoint conversion, aka Latin-1 decoding.
By downloading the data as bytes (ie, a ByteString), you can then
explicitly state that you want to do UTF-8 decoding instead of
Latin-1.

It would be entirely possible to write an HTTP library that does this
automatically, but it would be inherently limited to a single encoding
type. By dealing directly with bytestrings, you can work with any
character encoding, as well as binary data such as images which does
not have any character encoding.


OK, I think I understand now. I was under the assumption that the 
Network.HTTP package will take a look at the Content-Type header and do 
a behind-the-scene conversion before decoding those bytes.


Thanks for your help.

--
Ionuț G. Stan  |  http://igstan.ro
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

2010-10-17 Thread Mark Lentczner
On Oct 17, 2010, at 5:37 AM, Michael Snoyman mich...@snoyman.com wrote:
 in the case of JSON, I believe this is always specified as UTF-8.

RFC 4627 section 3 says that JSON must be encoded in Unicode, but all encodings 
are acceptable. The encoding is inferred by the firsthand four octets. So you 
need to be prepared to decode UTF-16 and UTF-32 and endian variants.

- Mark___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

2010-10-16 Thread Ionut G. Stan

Hi,

I'm trying to decode this JSON response: 
http://github.com/api/v2/json/user/show/igstan


As you can see, the name field contains a non-latin character: ț, and 
it appears that Text.JSON. can't decode this response when it comes from 
Network.HTTP. I've tried Network.HTTP.Enumerator too, but the problem 
persists. Here's a simple (hopefully) reproducible test case:


http://gist.github.com/630319

If you load it in ghci and call main, you'll see that it doesn't 
properly show the user name. Also, calling:


request http://github.com/api/v2/json/user/show/igstan;

will display the respective character encoded in a way that I have no 
idea whether or not is correct (Unicode is not one of my strong points 
for the moment).


Can anyone shed some light on this problem? Which package is the culprit?

Thanks,
--
Ionuț G. Stan  |  http://igstan.ro
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe