On Mon, Aug 27, 2012 at 10:23 AM, Vladimir Shapovalov <[email protected]>wrote:
> Thank you for the explanation. > > It looks like the serialization routine in one case (solr) encodes > characters_to_binary() > and by normal fetch not. Assuming we fetch in JSON in both cases. > No, Olav is incorrect in his analysis of the code. The unicode:characters_to_binary call is only made in riak_search_utils:to_utf8. This function is used to convert incoming data only. Erlang likes unicode data in binaries, to_utf8 converts anything non-binary via the unicode API. In your case this doesn't apply since mochijson2 is already decoding your value as a binary. It won't even pass through the characters_to_binary call. This is not why you are seeing escaped unicode in your JSON. By default mochijson2 escapes unicode (see https://github.com/mochi/mochiweb/blob/master/src/mochijson2.erl#L295). There is an option to emit unicode as UTF8 byte sequences. You do this by calling mochijson2:encoder([{utf8, true}]) which returns a function. The problem with this is that function returns an iolist which may not be appropriately understood by whatever is receiving it. E.g. an io:format call, even with the unicode modifier, will produce incorrect results. 113> Enc = mochijson2:encoder([{utf8,true}]). #Fun<mochijson2.0.107902433> 114> io:format("~ts~n", [Enc({struct, [{street, unicode:characters_to_binary("Sabinastraße")}]})]). {"street":"SabinastraÃe"} ok 115> io:format("~ts~n", [mochijson2:encode({struct, [{street, unicode:characters_to_binary("Sabinastraße")}]})]). {"street":"Sabinastra\u00dfe"} ok In fact, I think once in the form of an io_list() and no longer chardata() you are in trouble without some custom code. I think mochijson2 would have to change to return chardata() to achieve your desired result. I don't think it would be too hard of a patch but it is late and my tired brain is shutting down. Sorry if this was more technical an answer than you wanted. I wanted the reasons to be clear rather than leave you with a shallow "no." > > On Sun, Aug 26, 2012 at 6:07 PM, Olav Frengstad <[email protected]> wrote: > >> The current implementation uses mochiweb2 to serialize JSON. By itself >> mochijson2 produces the output you expect (src/riak_solr_output.erl): >> >> 55> io:format("~s~n", [mochijson2:encode("Sabinastraße")]). >> [83,97,98,105,110,97,115,116,114,97,223,101] >> > No, what you passed here is not an object but an array. It should be {struct, [{street, unicode:characters_to_binary("Sabinastraße")}]}. The value must be a binary and you can't simply do <<"Sabinastraße">> or list_to_binary("Sabinastraße") because it produces bad UTF-8 and mochijson2 will tell you so. -Z
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
