Somewhere ‘down the line’, a few records that have been ingested into an ML
database have embedded hex code from a different encoding scheme than the
native UTF8 (such as cp-1251, cp-1256):
<ml:title source=" " datetime="" xmlns:ml="
">\xD2\xE5\xED\xE4\xE5\xED\xF6\xE8\xE8 \xE8
\xE7\xE0\xEA\xEE\xED\xEE\xEC\xE5\xF0\xED\xEE\xF1\xF2\xE8
\xE2\xE5\xE4\xE5\xED\xE8\xFF \xFD\xEA\xEE\xEB\xEE\xE3\xE8\xF7\xE5\xF1\xEA\xE8
\xE1\xE5\xE7\xEE\xEF\xE0\xF1\xED\xEE\xE3\xEE
\xEF\xF0\xEE\xE8\xE7\xE2\xEE\xE4\xF1\xF2\xE2\xE0
\xF1\xE5\xEB\xFC\xF1\xEA\xEE\xF5\xEE\xE7\xFF\xE9\xF1\xF2\xE2\xE5\xED\xED\xEE\xE9
\xEF\xF0\xEE\xE4\xF3\xEA\xF6\xE8\xE8</ml:title>
Once these records have been identified and encoding scheme determined
(xml:lang is present in sibling elements), how do I reprocess (i.e. say ‘input’
is cp-1251, output utf8)? I can see xdmp:document-load has an encoding option,
but I’d hope there is a better way to handle this than export then reimport.
I’m not sure if this helps clarify. I can do this in PERL:
======
use strict;
use warnings;
require "Encode.pm";
binmode STDOUT, ":encoding(utf-8)";
my $string = "\xCF\xEE\xE2\xFB\xF8\xE5\xED\xE8\xE5
\xFD\xF4\xF4\xE5\xEA\xF2\xE8\xE2\xED\xEE\xF1\xF2\xE8
\xF3\xEF\xF0\xE0\xE2\xEB\xE5\xED\xE8\xFF \xE2
\xEA\xEE\xEE\xEF\xE5\xF0\xE0\xF2\xE8\xE2\xED\xEE-\xE8\xED\xF2\xE5\xE3\xF0\xE0\xF6\xE8\xEE\xED\xED\xFB\xF5
\xF1\xF2\xF0\xF3\xEA\xF2\xF3\xF0\xE0\xF5 \xEF\xF3\xF2\xE5\xEC
\xE2\xED\xE5\xE4\xF0\xE5\xED\xE8\xFF \xF1\xE8\xF1\xF2\xE5\xEC\xFB
\xE1\xFE\xE4\xE6\xE5\xF2\xE8\xF0\xEE\xE2\xE0\xED\xE8\xFF";
print Encode::decode("cp-1251",$string);
--> Повышение эффективности управления в кооперативно-интеграционных структурах
путем внедрения системы бюджетирования<--
======
Thank you,
--Matthew Treskon
This electronic message contains information generated by the USDA solely for
the intended recipients. Any unauthorized interception of this message or the
use or disclosure of the information it contains may violate the law and
subject the violator to civil or criminal penalties. If you believe you have
received this message in error, please notify the sender and delete the email
immediately.
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general