Re: [MarkLogic Dev General] MD5 - Hash Question

David Lee Thu, 17 Jul 2014 10:56:39 -0700

I will reply with a shorter version of what came up recently on another list.
ML uses UTF8 internally for xs:string which is what xdmp:quote() returns,
But that’s misleading.  Its irrelevant because the encoding of a string is not 
exposed at the XQuery layer.
We could be using anything and the behavior would be the same.
The only time encoding comes into play is when serializing or desterilizing 
from *bytes* (i.e to/from a file or text document)


Unless your using text or binary documents then generating a hash on them is 
pointless.
You're not going to get the same hash as the original file .  This is true for 
any processor of XML or JSON or structured documents that does any parsing or 
serializing.


Documents stored in the database are not byte-equal to the source document.

This is true at multiple levels.   At the "store on the disk" level a 
"Document" doesn’t resemble the source document at all,

Any more than a CSV file resembles the block structure of an Oracle partition - 
let alone byte equal.

For some primitive document types - namely binary, what you get back should be 
equal to what you put in - exactly, But even text documents might undergo 
character set translation , Unicode normalization so what you get back may Not 
be byte equal ...

Structured docs are more complex,  XML and JSON first undergo charset 
translation like text, then they are parsed Into an internal node structure - 
and then stored in a very concise format.

Even ignoring the disk format and MarkLogic ... ALL XML and JSON processors 
share this issue.

The Text Serialized form of a document is not the same thing as its value.   
You can rarely, if ever, do a round trip

On JSON or XML and get back byte for byte what you started with - Its 
*critical* to understand that the byte/text format

Of documents is the transport layer, not the document model itself.  And with 
that concept, there are many equivalent ways to express the same document 
model.   Very simple example, in XML attributes have no ordering guarantee (nor 
in JSON),

Spaces between attributes like <foo  a="b"     c="d"/>  are ignored so <foo 
c="d" a="b/> is the same document.

To wrap this up, calculating a hash of a document before storing it , and after 
retrieving it doesn’t give you the answer you want.
The hashes will almost certainly be different - whether or not the document "is 
the same thing" ...

To provide better advise on how to compare for document equality I need more 
specifics such as what format the document is in, and how you want to define 
equality.


From: [email protected] 
[mailto:[email protected]] On Behalf Of Sujith
Sent: Thursday, July 17, 2014 11:20 AM
To: MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] MD5 - Hash Question

What is the default encoding that ML uses for xdmp:quote().

There is a daily job that loads hadoop ( Cloudera Dist ) with the files that we 
have in ML using mlcp. Now we want to compare if both of them are in Sync, so 
we are using md5 hash for validation. Initially we provided Hadoop with our 
Hash and they came back saying that it didn't match with their data. After 
doing some analysis we figured out that we should explicitly specify the 
encoding as UTF-8   option in xdmp:quote as Java Program on their end is doing 
the same.

(: The Hash that didnot match :)
xquery version "1.0-ml";
xdmp:md5(xdmp:quote(fn:doc("/sample.xml") )


In other words what would be the default encoding xdmp:quote uses ( My 
assumpotion is that by Default ML saves Data as UTF-8 encoding is no encoding 
is specified and while it retrieves the documents the same would be used. )


(: the Hash that Match :)

xquery version "1.0-ml";
xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">
      <output-encoding>utf-8</output-encoding>
      <omit-xml-declaration>yes</omit-xml-declaration>
    </options>))

Any insight is very much appreciated.


--
Thanks & Regards
SujithMaram

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] MD5 - Hash Question

Reply via email to