Hey all, I'm trying to load the Unihan database into CouchDB (extracted from the Unicode specification). Parts of it requires passing utf-8 characters, which according to the JSON specification requires escaping to \uxxxx format.
Since the initial load has around 71,000 records, I'm using bulk uploading via: curl -X POST http://localhost:5984/unihan/_bulk_docs -H "Content-Type: application/json; charset=utf-8" -d @data/Unihan-5.1.0.json However, I would run into this error: [info] [<0.62.0>] HTTP Error (code 500): {'EXIT', {if_clause, [{xmerl_ucs,char_to_utf8,1}, {lists,flatmap,2}, {cjson,tokenize,2}, {cjson,decode1,2}, {cjson,decode_object,3}, {cjson,decode_array,3}, {cjson,decode_object,3}, {cjson,json_decode,2}]}} This error occurred on a recent trunk version as well as the 0.8.1 tarball (sorry, I don't remember the SVN rev number of the version I used). I had attempted to use the latest trunk version (r707821), but since that did not even compile, I couldn't try it. I don't know which record it is barfing on. Pulling a single record out: { "unihan_version": "5.1.0", "unihan": { "kIRG_GSource":"HZ", "kOtherNumeric":"7", "kIRGHanyuDaZidian":"10004.020", "kDefinition":"the original form for \u4e03 U+4E03", "kCihaiT":"10.601", "kPhonetic":"1635", "kMandarin":"QI1", "kCantonese":"cat1", "kRSKangXi":"1.1", "kHanYu":"10004.020", "kRSUnicode":"1.1", "kIRGKangXi":"0076.021"}, "_id":"U+20001" } } Seems to work fine even with the bulk uploader. I'm going to attempt to insert the records one by one. Maybe I can find out which record it is barfing on, maybe the json was invalid. It seems to me though, that something is barfing on utf8 on bulk uploads over a certain limit. If someone wants to try it out, I can supply the json file I used. Any help is appreciated. -- Ho-Sheng Hsiao, VP of Engineering Isshen Solutions, Inc. (334) 559-9153