Chris Anderson wrote: > If you don't mind, I'll take a look at it. The error you showed sure > looks like a utf8 error, but with such a big bulk upload it's hard to > be sure. > > Perhaps you can put the Unihan-5.1.0.json file online somewhere, or if > you have it boiled down to records that are causing the problem, > singling those out would of course be helpful.
http://windgate.isshen.net/~hhh/couchdb/Unihan-5.1.0.json.gz http://windgate.isshen.net/~hhh/couchdb/loading.log.gz In the meantime, I may have found what was causing the utf8 error, and have found a different error being thrown. I modified the extraction script so that it will do a bulk upload with a single record. There were 9 errors of this type. When I took a look at the three of the records, it seem pretty obvious: {"unihan_version":"5.1.0", "unihan":{ "kSemanticVariant":"U+51F9<kLau", "kIRG_GSource":"KX", "kLau":"2272", "kIRGHanyuDaZidian":"10099.060", "kDefinition":"(Cant.) \u9152\ud841\udd44, a dimple", "kCantonese":"nap1", "kRSKangXi":"13.3", "kCheungBauer":"013\/05;;nap1", "kHanYu":"10099.060", "kCowles":"2861", "kIRG_TSource":"5-2152", "kRSUnicode":"13.3", "kMeyerWempe":"1968", "kIRGKangXi":"0129.050", "kCheungBauerIndex":"341.08"}, "_id":"U+20544" } {"unihan_version":"5.1.0", "unihan":{ "kVietnamese":"b\u1ea3u", "kDefinition":"(Cant.) \u751f\ud843\ude12\u4eba, a stranger", "kCantonese":"bou2", "kRSKangXi":"30.9", "kCheungBauer":"030\/09;;bou2", "kIRG_VSource":"0-3237", "kRSUnicode":"30.9", "kIRGKangXi":"0201.121", "kCheungBauerIndex":"365.10"}, "_id":"U+20E12" } {"unihan_version":"5.1.0", "unihan":{ "kSemanticVariant":"U+22E23", "kIRG_GSource":"KX", "kVietnamese":"n\u00edu", "kIRGHanyuDaZidian":"31971.020", "kDefinition":"(same as U+22E23 \ud84b\ude23) to select, pick", "kMandarin":"NIAO3", "kRSKangXi":"64.13", "kHanYu":"31971.020", "kIRG_TSource":"4-5048", "kRSUnicode":"64.13", "kIRGKangXi":"0458.310"}, "_id":"U+22D91" } What it looks like is that it is barfing on \u9152\ud841\udd44 The other error I was getting were weirder. I tried matching the error output with the record by verifying that it made it into the database, but there may be other records that did not report an error, yet CouchDB returned a 404 when I tried querying it. What I'll do is write a check script and have it run through all the records validating that the data matches the source. Here's a few of the other errors I was getting: {"ok":true,"new_revs":[{"id":"U+36B4","rev":"1465697479"}]} {"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n [[],\n {decoder,unicode,null,1,144,any},\n [115,101,110,111,32,102,111,32,101,102,105,119,41,\n 22994,32,115,97,32,101,109,97,115,40]]},\n {cjson,tokenize,2},\n {cjson,decode1,2},\n {cjson,decode_object,3},\n {cjson,decode_array,3},\n {cjson,decode_object,3},\n {cjson,json_decode,2},\n {couch_httpd,handle_db_request,3}]}"} {"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n [[],\n {decoder,unicode,null,1,205,any},\n [115,101,110,111,32,44,97,109,100,110,97,114,103,32,\n 59,110,101,109,111,119,32,114,111,102,32,116,99,\n 101,112,115,101,114,32,102,111,32,109,114,101,116,\n 32,97,32,59,107,108,105,109,32,59,110,97,109,111,\n 119,32,97,32,102,111,32,115,116,115,97,101,114,98,\n 32,101,104,116,32,41,23341,32,115,97,32,101,109,97,\n 115,40]]},\n {cjson,tokenize,2},\n {cjson,decode1,2},\n {cjson,decode_object,3},\n {cjson,decode_array,3},\n {cjson,decode_object,3},\n {cjson,json_decode,2},\n {couch_httpd,handle_db_request,3}]}"} {"ok":true,"new_revs":[{"id":"U+36B9","rev":"3226496426"}]} Records U+36B5 - U+36B8 were not loaded in. Weirdly enough, I think it is barfing on these two records: {"unihan_version":"5.1.0", "unihan":{ "kIRG_GSource":"KX", "kIRGHanyuDaZidian":"21037.080", "kDefinition":"(same as \u59d2)wife of one's husband's elder brother; (in ancient China) the elder of twins; a Chinese family name, (same as \u59ec) a handsome girl; a charming girl; a concubine; a Chinese family name", "kMandarin":"SI4", "kCantonese":"ci5", "kTotalStrokes":"8", "kHanYu":"21037.080", "kCangjie":"VRLR", "kIRG_TSource":"3-2843", "kRSUnicode":"38.5", "kIRGKangXi":"0258.100"}, "_id":"U+36B6" }, {"unihan_version":"5.1.0", "unihan":{ "kIRG_GSource":"KX", "kIRGHanyuDaZidian":"21039.040", "kDefinition":"(same as \u5b2d) the breasts of a woman; milk; a term of respect for women; grandma, one's elder sister or sisters, used for a girl's name","kCihaiT":"383.207","kMandarin":"ER3 NAI3", "kCantonese":"nai5", "kSBGY":"270.50", "kKPS1":"3CFA", "kIRG_KPSource": "KP1-3CFA", "kTotalStrokes":"8", "kHanYu":"21039.040", "kCangjie":"VOF", "kIRG_TSource":"3-2847", "kRSUnicode":"38.5", "kIRGKangXi":"0258.120"}, "_id":"U+36B7" } Where you have \u59d2) and \u5b2d) ... but why would that effect the other two records? As I said, I'll write a checking script and validate all the info is there. Since it will run or a while, I'll give it a shot after the first utf8 error gets fixed -- who knows? the first error type might have something to do with the second error type. Thanks for your help. Ho-Sheng Hsiao, VP of Engineering Isshen Solutions, Inc. (334) 559-9153 http://www.isshen.com