David,

1)      I’m using xml:lang attribute to ‘guess’ encoding – if xml:lang="Tr" 
then cp-1256.

2)      I stripped the xmlns. Yes the characters are just ascii representations 
of hex

3)      I mistakenly used different elements for the perl code and the xml 
element: here is the corresponding element:
<ml:title source=" " datetime="" xmlns:ml=" 
">\xCF\xEE\xE2\xFB\xF8\xE5\xED\xE8\xE5 
\xFD\xF4\xF4\xE5\xEA\xF2\xE8\xE2\xED\xEE\xF1\xF2\xE8 
\xF3\xEF\xF0\xE0\xE2\xEB\xE5\xED\xE8\xFF \xE2 
\xEA\xEE\xEE\xEF\xE5\xF0\xE0\xF2\xE8\xE2\xED\xEE-\xE8\xED\xF2\xE5\xE3\xF0\xE0\xF6\xE8\xEE\xED\xED\xFB\xF5
 \xF1\xF2\xF0\xF3\xEA\xF2\xF3\xF0\xE0\xF5 \xEF\xF3\xF2\xE5\xEC 
\xE2\xED\xE5\xE4\xF0\xE5\xED\xE8\xFF \xF1\xE8\xF1\xF2\xE5\xEC\xFB 
\xE1\xFE\xE4\xE6\xE5\xF2\xE8\xF0\xEE\xE2\xE0\xED\xE8\xFF</ml:title>

4)      So how do I translate these ASCII literal escape sequences?

Thanks,
Matthew



From: [email protected] 
[mailto:[email protected]] On Behalf Of David Lee
Sent: Thursday, March 28, 2013 5:18 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested 
records/elements

It would help if you explain the problem a little better.
A few issues that come to mind

1) xml:lang has nothing to do with encoding, so what is the expectation there ?

2) The sample doc is not encoded in anything except ascii:  (I assume the xmlns 
is bogus)
<ml:title source=" " datetime="" xmlns:ml=" ">\xD2\xE5\xE ...

Those are literal ascii characters  "\"  "x" "D" "2" "\" "x" "E" "5" etc
That has nothing at all to do with encoding.

3) Your perl code is using  PERL escape sequences which have nothing to do with 
the data in your sample XML.

4) Encoding is a property of a file *outside* of the XML data model.   If a 
file is imported in the wrong encoding it will be garbage
and cant be re-encoded ... But if the file is like you say above, its not that 
its badly encoded ... its containing literal escape sequences as ASCII
which is an entirely different problem.




-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]<mailto:[email protected]>
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Treskon, Matthew
Sent: Thursday, March 28, 2013 5:12 PM
To: [email protected]<mailto:[email protected]>
Subject: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements

Somewhere ‘down the line’, a few records that have been ingested into an ML 
database have embedded hex code from a different encoding scheme than the 
native UTF8 (such as cp-1251, cp-1256):

<ml:title source=" " datetime="" xmlns:ml=" 
">\xD2\xE5\xED\xE4\xE5\xED\xF6\xE8\xE8 \xE8 
\xE7\xE0\xEA\xEE\xED\xEE\xEC\xE5\xF0\xED\xEE\xF1\xF2\xE8 
\xE2\xE5\xE4\xE5\xED\xE8\xFF \xFD\xEA\xEE\xEB\xEE\xE3\xE8\xF7\xE5\xF1\xEA\xE8 
\xE1\xE5\xE7\xEE\xEF\xE0\xF1\xED\xEE\xE3\xEE 
\xEF\xF0\xEE\xE8\xE7\xE2\xEE\xE4\xF1\xF2\xE2\xE0 
\xF1\xE5\xEB\xFC\xF1\xEA\xEE\xF5\xEE\xE7\xFF\xE9\xF1\xF2\xE2\xE5\xED\xED\xEE\xE9
 \xEF\xF0\xEE\xE4\xF3\xEA\xF6\xE8\xE8</ml:title>

Once these records have been identified and encoding scheme determined 
(xml:lang is present in sibling elements), how do I reprocess (i.e. say ‘input’ 
is cp-1251, output utf8)? I can see xdmp:document-load has an encoding option, 
but I’d hope there is a better way to handle this than export then reimport.

I’m not sure if this helps clarify. I can do this in PERL:

======
use strict;
use warnings;
require "Encode.pm";

binmode STDOUT, ":encoding(utf-8)";

my $string = "\xCF\xEE\xE2\xFB\xF8\xE5\xED\xE8\xE5 
\xFD\xF4\xF4\xE5\xEA\xF2\xE8\xE2\xED\xEE\xF1\xF2\xE8 
\xF3\xEF\xF0\xE0\xE2\xEB\xE5\xED\xE8\xFF \xE2 
\xEA\xEE\xEE\xEF\xE5\xF0\xE0\xF2\xE8\xE2\xED\xEE-\xE8\xED\xF2\xE5\xE3\xF0\xE0\xF6\xE8\xEE\xED\xED\xFB\xF5
 \xF1\xF2\xF0\xF3\xEA\xF2\xF3\xF0\xE0\xF5 \xEF\xF3\xF2\xE5\xEC 
\xE2\xED\xE5\xE4\xF0\xE5\xED\xE8\xFF \xF1\xE8\xF1\xF2\xE5\xEC\xFB 
\xE1\xFE\xE4\xE6\xE5\xF2\xE8\xF0\xEE\xE2\xE0\xED\xE8\xFF";

print Encode::decode("cp-1251",$string);

--> Повышение эффективности управления в кооперативно-интеграционных структурах 
путем внедрения системы бюджетирования<--

======


Thank you,
--Matthew Treskon







This electronic message contains information generated by the USDA solely for 
the intended recipients. Any unauthorized interception of this message or the 
use or disclosure of the information it contains may violate the law and 
subject the violator to civil or criminal penalties. If you believe you have 
received this message in error, please notify the sender and delete the email 
immediately.
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to