On Mon, Aug 27, 2012 at 10:23 AM, Vladimir Shapovalov
<[email protected]>wrote:

> Thank you for the explanation.
>
> It looks like the serialization routine in one case (solr) encodes 
> characters_to_binary()
> and by normal fetch not. Assuming we fetch in JSON in both cases.
>

No, Olav is incorrect in his analysis of the code.  The
unicode:characters_to_binary call is only made in
riak_search_utils:to_utf8.  This function is used to convert incoming data
only.  Erlang likes unicode data in binaries, to_utf8 converts anything
non-binary via the unicode API.  In your case this doesn't apply since
mochijson2 is already decoding your value as a binary.  It won't even pass
through the characters_to_binary call.  This is not why you are seeing
escaped unicode in your JSON.

By default mochijson2 escapes unicode (see
https://github.com/mochi/mochiweb/blob/master/src/mochijson2.erl#L295).
 There is an option to emit unicode as UTF8 byte sequences.  You do this by
calling mochijson2:encoder([{utf8, true}]) which returns a function.  The
problem with this is that function returns an iolist which may not be
appropriately understood by whatever is receiving it.  E.g. an io:format
call, even with the unicode modifier, will produce incorrect results.

113> Enc = mochijson2:encoder([{utf8,true}]).

#Fun<mochijson2.0.107902433>
114> io:format("~ts~n", [Enc({struct, [{street,
unicode:characters_to_binary("Sabinastraße")}]})]).
{"street":"Sabinastraße"}
ok
115> io:format("~ts~n", [mochijson2:encode({struct, [{street,
unicode:characters_to_binary("Sabinastraße")}]})]).
{"street":"Sabinastra\u00dfe"}
ok

In fact, I think once in the form of an io_list() and no longer chardata()
you are in trouble without some custom code.  I think mochijson2 would have
to change to return chardata() to achieve your desired result.  I don't
think it would be too hard of a patch but it is late and my tired brain is
shutting down.

Sorry if this was more technical an answer than you wanted.  I wanted the
reasons to be clear rather than leave you with a shallow "no."


>
> On Sun, Aug 26, 2012 at 6:07 PM, Olav Frengstad <[email protected]> wrote:
>
>> The current implementation uses mochiweb2 to serialize JSON. By itself
>> mochijson2 produces the output you expect (src/riak_solr_output.erl):
>>
>> 55> io:format("~s~n", [mochijson2:encode("Sabinastraße")]).
>> [83,97,98,105,110,97,115,116,114,97,223,101]
>>
>
No, what you passed here is not an object but an array.  It should be
{struct, [{street, unicode:characters_to_binary("Sabinastraße")}]}.  The
value must be a binary and you can't simply do <<"Sabinastraße">> or
list_to_binary("Sabinastraße") because it produces bad UTF-8 and mochijson2
will tell you so.

-Z
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to