Re: [MarkLogic Dev General] MD5 - Hash Question

David Ennis Fri, 18 Jul 2014 02:23:26 -0700

HI.

For what has been described, perhaps fn:deep-equal can help as it
more-or-less takes into account the 'pitfalls' listed by David Lee:


- for attributes, order does not matter
- whitepace is ignored in element definitions as well as between elements
(essentially ignored if not an atomoc value, I would assume)
- (presumably because the comparison is done on the internal
representation), then it is also true that things like single-quote or
double-quote make no difference (not the case if you were doing a hash
check)

Therefore, document are equal if all of the nodes, child nodes, etc are
present with the exact same attributes and the same values in all places -
while ignoring order of attributes as well as un-needed whitespace.

In a Nutshell:
<foo > <bar a="b" c='d'>baz</bar></foo>
equals
<foo> <bar  c='d' a='b'>baz</bar>    </foo>
(and of course this follows for deep structures as well)

So, perhaps a solution would be a check in MarkLogic:
A =  internal doc
B = fetch doc from Hadoop via http/odbc/whatever and do a deep-diff on it.
(or, expose deep-diff via some ML API and then just have some system that
fetches form hadoop and asks ML if the doc matches the one it has)

Kind Regards,
David Ennis


On 17 July 2014 19:56, David Lee <[email protected]> wrote:

>  I will reply with a shorter version of what came up recently on another
> list.
>
> ML uses UTF8 internally for xs:string which is what xdmp:quote() returns,
>
> But that’s misleading.  Its irrelevant because the encoding of a string is
> not exposed at the XQuery layer.
>
> We could be using anything and the behavior would be the same.
>
> The only time encoding comes into play is when serializing or
> desterilizing from *bytes* (i.e to/from a file or text document)
>
>
>
> Unless your using text or binary documents then generating a hash on them
> is pointless.
>
> You're not going to get the same hash as the original file .  This is true
> for any processor of XML or JSON or structured documents that does any
> parsing or serializing.
>
>
>
> Documents stored in the database are not byte-equal to the source document.
>
> This is true at multiple levels.   At the "store on the disk" level a
> "Document" doesn’t resemble the source document at all,
>
> Any more than a CSV file resembles the block structure of an Oracle
> partition - let alone byte equal.
>
> For some primitive document types - namely binary, what you get back
> should be equal to what you put in - exactly, But even text documents might
> undergo character set translation , Unicode normalization so what you get
> back may Not be byte equal ...
>
> Structured docs are more complex,  XML and JSON first undergo charset
> translation like text, then they are parsed Into an internal node structure
> - and then stored in a very concise format.
>
> Even ignoring the disk format and MarkLogic ... ALL XML and JSON
> processors share this issue.
>
> The Text Serialized form of a document is not the same thing as its
> value.   You can rarely, if ever, do a round trip
>
> On JSON or XML and get back byte for byte what you started with - Its
> *critical* to understand that the byte/text format
>
> Of documents is the transport layer, not the document model itself.  And
> with that concept, there are many equivalent ways to express the same
> document model.   Very simple example, in XML attributes have no ordering
> guarantee (nor in JSON),
>
> Spaces between attributes like <foo  a="b"     c="d"/>  are ignored so
> <foo c="d" a="b/> is the same document.
>
>
>
> To wrap this up, calculating a hash of a document before storing it , and
> after retrieving it doesn’t give you the answer you want.
>
> The hashes will almost certainly be different - whether or not the
> document "is the same thing" ...
>
>
>
> To provide better advise on how to compare for document equality I need
> more specifics such as what format the document is in, and how you want to
> define equality.
>
>
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *Sujith
> *Sent:* Thursday, July 17, 2014 11:20 AM
> *To:* MarkLogic Developer Discussion
> *Subject:* [MarkLogic Dev General] MD5 - Hash Question
>
>
>
> What is the default encoding that ML uses for xdmp:quote().
>
>
>
> There is a daily job that loads hadoop ( Cloudera Dist ) with the files
> that we have in ML using mlcp. Now we want to compare if both of them are
> in Sync, so we are using md5 hash for validation. Initially we provided
> Hadoop with our Hash and they came back saying that it didn't match with
> their data. After doing some analysis we figured out that we should
> explicitly specify the encoding as UTF-8   option in xdmp:quote as Java
> Program on their end is doing the same.
>
>
>
> (: The Hash that didnot match :)
>
> xquery version "1.0-ml";
>
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml") )
>
>
>
>
>
> In other words what would be the default encoding xdmp:quote uses ( My
> assumpotion is that by Default ML saves Data as UTF-8 encoding is no
> encoding is specified and while it retrieves the documents the same would
> be used. )
>
>
>
>
>
> (: the Hash that Match :)
>
>
>
> xquery version "1.0-ml";
>
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">
>
>       <output-encoding>utf-8</output-encoding>
>
>       <omit-xml-declaration>yes</omit-xml-declaration>
>
>     </options>))
>
>
>
> Any insight is very much appreciated.
>
>
>
>
>
> --
>
> Thanks & Regards
> SujithMaram
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] MD5 - Hash Question

Reply via email to