I will reply with a shorter version of what came up recently on another list. ML uses UTF8 internally for xs:string which is what xdmp:quote() returns, But that’s misleading. Its irrelevant because the encoding of a string is not exposed at the XQuery layer. We could be using anything and the behavior would be the same. The only time encoding comes into play is when serializing or desterilizing from *bytes* (i.e to/from a file or text document)
Unless your using text or binary documents then generating a hash on them is pointless. You're not going to get the same hash as the original file . This is true for any processor of XML or JSON or structured documents that does any parsing or serializing. Documents stored in the database are not byte-equal to the source document. This is true at multiple levels. At the "store on the disk" level a "Document" doesn’t resemble the source document at all, Any more than a CSV file resembles the block structure of an Oracle partition - let alone byte equal. For some primitive document types - namely binary, what you get back should be equal to what you put in - exactly, But even text documents might undergo character set translation , Unicode normalization so what you get back may Not be byte equal ... Structured docs are more complex, XML and JSON first undergo charset translation like text, then they are parsed Into an internal node structure - and then stored in a very concise format. Even ignoring the disk format and MarkLogic ... ALL XML and JSON processors share this issue. The Text Serialized form of a document is not the same thing as its value. You can rarely, if ever, do a round trip On JSON or XML and get back byte for byte what you started with - Its *critical* to understand that the byte/text format Of documents is the transport layer, not the document model itself. And with that concept, there are many equivalent ways to express the same document model. Very simple example, in XML attributes have no ordering guarantee (nor in JSON), Spaces between attributes like <foo a="b" c="d"/> are ignored so <foo c="d" a="b/> is the same document. To wrap this up, calculating a hash of a document before storing it , and after retrieving it doesn’t give you the answer you want. The hashes will almost certainly be different - whether or not the document "is the same thing" ... To provide better advise on how to compare for document equality I need more specifics such as what format the document is in, and how you want to define equality. From: [email protected] [mailto:[email protected]] On Behalf Of Sujith Sent: Thursday, July 17, 2014 11:20 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] MD5 - Hash Question What is the default encoding that ML uses for xdmp:quote(). There is a daily job that loads hadoop ( Cloudera Dist ) with the files that we have in ML using mlcp. Now we want to compare if both of them are in Sync, so we are using md5 hash for validation. Initially we provided Hadoop with our Hash and they came back saying that it didn't match with their data. After doing some analysis we figured out that we should explicitly specify the encoding as UTF-8 option in xdmp:quote as Java Program on their end is doing the same. (: The Hash that didnot match :) xquery version "1.0-ml"; xdmp:md5(xdmp:quote(fn:doc("/sample.xml") ) In other words what would be the default encoding xdmp:quote uses ( My assumpotion is that by Default ML saves Data as UTF-8 encoding is no encoding is specified and while it retrieves the documents the same would be used. ) (: the Hash that Match :) xquery version "1.0-ml"; xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote"> <output-encoding>utf-8</output-encoding> <omit-xml-declaration>yes</omit-xml-declaration> </options>)) Any insight is very much appreciated. -- Thanks & Regards SujithMaram
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
