Thanks David. I’ll revisit the in-house process. If the error was introduced during whatever processing the provider does, then your sketch will be helpful.
--Matthew From: [email protected] [mailto:[email protected]] On Behalf Of David Lee Sent: Thursday, March 28, 2013 5:54 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements I have a hard time imagining that the data started out as ascii escapes ... something in the transition must have messed this up. So I suggest looking at the *process* by which your "source database" ended up in MarkLogic. It just doesnt make sense. But ... if there is NO OTHER WAY ... you have to parse this as text and create new text. The fn:tokenize function might be a place to start https://docs.marklogic.com/fn:tokenize This allows you to split up text into an array of strings like let $strs := fn:tokenize( $element , "\\x<file:///\\x>" ) Now you will have a sequence like ("FA" , "AC" , "B5" ) You can iterate through those and parse them as hex https://docs.marklogic.com/xdmp:hex-to-integer Now you end up with binary values in a sequenc ... but STILL have to turn them from UTF8 (if that is it) into unicode. That will require knowing how UTF (or whatever sequence you are dealing with) does things ... Thats a pain. you dont want to go there ... but its possible. http://en.wikipedia.org/wiki/UTF8 Once you create a sequence of unicode codepoints you can convert back to a string using http://docs.marklogic.com/fn:codepoints-to-string Maybe there is a better way ... But I would encourage you to look back at your process of HOW the data ended up in marklogic like this. It would be vastly easier to fix that then after "A fence on the hill or an ambulance down in the valley" http://www.wealthandwant.com/docs/Malins_ambulance.html ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected]<mailto:[email protected]> Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Treskon, Matthew Sent: Thursday, March 28, 2013 5:44 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements Short answer: I got the data that way from a source database. I’ll try talking with the provider but that may not be successful. Plan B: your sketch would be much appreciated. Thanks, Matthew From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of David Lee Sent: Thursday, March 28, 2013 5:38 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements First question ... to punt on your question. How did these escape chars end up in your data instead of the real data ? The BEST solution is to get the data in correctly in the first place. Now that the data is in this form ... its going to be painful. Possible, but painful. You will have to do text parsing on the data using whatever you know about encoding to get it into something real then create a new document. There is no built-in methods to parse this kind of data ... it CAN be done but it will take work ... If you really need it done I can help sketch out a plan, but the better solution is "dont do that" ... Can you find out how this bogus data got in there in the first place and fix it from there ? ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected]<mailto:[email protected]> Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Treskon, Matthew Sent: Thursday, March 28, 2013 5:31 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements David, 1) I’m using xml:lang attribute to ‘guess’ encoding – if xml:lang="Tr" then cp-1256. 2) I stripped the xmlns. Yes the characters are just ascii representations of hex 3) I mistakenly used different elements for the perl code and the xml element: here is the corresponding element: <ml:title source=" " datetime="" xmlns:ml=" ">\xCF\xEE\xE2\xFB\xF8\xE5\xED\xE8\xE5 \xFD\xF4\xF4\xE5\xEA\xF2\xE8\xE2\xED\xEE\xF1\xF2\xE8 \xF3\xEF\xF0\xE0\xE2\xEB\xE5\xED\xE8\xFF \xE2 \xEA\xEE\xEE\xEF\xE5\xF0\xE0\xF2\xE8\xE2\xED\xEE-\xE8\xED\xF2\xE5\xE3\xF0\xE0\xF6\xE8\xEE\xED\xED\xFB\xF5 \xF1\xF2\xF0\xF3\xEA\xF2\xF3\xF0\xE0\xF5 \xEF\xF3\xF2\xE5\xEC \xE2\xED\xE5\xE4\xF0\xE5\xED\xE8\xFF \xF1\xE8\xF1\xF2\xE5\xEC\xFB \xE1\xFE\xE4\xE6\xE5\xF2\xE8\xF0\xEE\xE2\xE0\xED\xE8\xFF</ml:title> 4) So how do I translate these ASCII literal escape sequences? Thanks, Matthew From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of David Lee Sent: Thursday, March 28, 2013 5:18 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements It would help if you explain the problem a little better. A few issues that come to mind 1) xml:lang has nothing to do with encoding, so what is the expectation there ? 2) The sample doc is not encoded in anything except ascii: (I assume the xmlns is bogus) <ml:title source=" " datetime="" xmlns:ml=" ">\xD2\xE5\xE ... Those are literal ascii characters "\" "x" "D" "2" "\" "x" "E" "5" etc That has nothing at all to do with encoding. 3) Your perl code is using PERL escape sequences which have nothing to do with the data in your sample XML. 4) Encoding is a property of a file *outside* of the XML data model. If a file is imported in the wrong encoding it will be garbage and cant be re-encoded ... But if the file is like you say above, its not that its badly encoded ... its containing literal escape sequences as ASCII which is an entirely different problem. ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected]<mailto:[email protected]> Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Treskon, Matthew Sent: Thursday, March 28, 2013 5:12 PM To: [email protected]<mailto:[email protected]> Subject: [MarkLogic Dev General] Reprocessing non-UTF8 ingested records/elements Somewhere ‘down the line’, a few records that have been ingested into an ML database have embedded hex code from a different encoding scheme than the native UTF8 (such as cp-1251, cp-1256): <ml:title source=" " datetime="" xmlns:ml=" ">\xD2\xE5\xED\xE4\xE5\xED\xF6\xE8\xE8 \xE8 \xE7\xE0\xEA\xEE\xED\xEE\xEC\xE5\xF0\xED\xEE\xF1\xF2\xE8 \xE2\xE5\xE4\xE5\xED\xE8\xFF \xFD\xEA\xEE\xEB\xEE\xE3\xE8\xF7\xE5\xF1\xEA\xE8 \xE1\xE5\xE7\xEE\xEF\xE0\xF1\xED\xEE\xE3\xEE \xEF\xF0\xEE\xE8\xE7\xE2\xEE\xE4\xF1\xF2\xE2\xE0 \xF1\xE5\xEB\xFC\xF1\xEA\xEE\xF5\xEE\xE7\xFF\xE9\xF1\xF2\xE2\xE5\xED\xED\xEE\xE9 \xEF\xF0\xEE\xE4\xF3\xEA\xF6\xE8\xE8</ml:title> Once these records have been identified and encoding scheme determined (xml:lang is present in sibling elements), how do I reprocess (i.e. say ‘input’ is cp-1251, output utf8)? I can see xdmp:document-load has an encoding option, but I’d hope there is a better way to handle this than export then reimport. I’m not sure if this helps clarify. I can do this in PERL: ====== use strict; use warnings; require "Encode.pm"; binmode STDOUT, ":encoding(utf-8)"; my $string = "\xCF\xEE\xE2\xFB\xF8\xE5\xED\xE8\xE5 \xFD\xF4\xF4\xE5\xEA\xF2\xE8\xE2\xED\xEE\xF1\xF2\xE8 \xF3\xEF\xF0\xE0\xE2\xEB\xE5\xED\xE8\xFF \xE2 \xEA\xEE\xEE\xEF\xE5\xF0\xE0\xF2\xE8\xE2\xED\xEE-\xE8\xED\xF2\xE5\xE3\xF0\xE0\xF6\xE8\xEE\xED\xED\xFB\xF5 \xF1\xF2\xF0\xF3\xEA\xF2\xF3\xF0\xE0\xF5 \xEF\xF3\xF2\xE5\xEC \xE2\xED\xE5\xE4\xF0\xE5\xED\xE8\xFF \xF1\xE8\xF1\xF2\xE5\xEC\xFB \xE1\xFE\xE4\xE6\xE5\xF2\xE8\xF0\xEE\xE2\xE0\xED\xE8\xFF"; print Encode::decode("cp-1251",$string); --> Повышение эффективности управления в кооперативно-интеграционных структурах путем внедрения системы бюджетирования<-- ====== Thank you, --Matthew Treskon This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
