Re: [MarkLogic Dev General] MD5 - Hash Question

Sujith Fri, 18 Jul 2014 08:56:48 -0700

Thanks Lee / Ennis,

Many thanks for the Insight. We think we have the md5 hash fixed .


Here is the requirements ( Compare XML documents in MarkLogic vs Hadoop
System )

1. We have constant updates to the files that are in MarkLogic.
2. Here in our shop Hadoop is the central repository / DataStore that
collects the data from all the systems that support the organization, and
MarkLogic is one of the feeder.
3. We use mlcp to feed the updates / data to Hadoop , we have
lastUpdateDateTime element that is used to capture the updates and feed the
incrementals.
4. for now all the data is XML data.
5. Here the truth is MarkLogic ( Feeder to Hadoop ) and at a given point
Hadoop wants to reconcile the Data.
6. To achieve this we went with md5 as an approach ( the same way XQSync
does ) .
7. When we provided Hash of the documents, we were told by the Hadoop team
that they don't match with the Documents they have

After doing some more analysis on this, we figured that when we use
<omit-xml-declaration> option YES  on xdmp:quote(fn:doc("/sample.xml") ,
the hashes match.
So when mlcp loads the data to Hadoop, we see that the XML declaration is
missing / omitted and this is the difference between the source & target
that is giving us a hash mismatch.

xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">

      <omit-xml-declaration>yes</omit-xml-declaration>

    </options>))


Many Thanks!!!


On Fri, Jul 18, 2014 at 8:13 AM, David Lee <[email protected]> wrote:

>  Another suggestion (which I used in the past) is to do the MD5 on the
> text document *before*
>
> Sending it to ML and storing it as a property.  Then when a new document
> arrives check the MD5
>
> Of the text document (on disk) and if they match what I stored then I know
> they are the same and skip it,
>
> If they don’t match, I know something has changed (may be irrelevant
> whitespace) so I updated the doc.
>
> This works well for the purposes of 'sync' like tools - where it is useful
> to assert
>
> A)     If the MD5 of the new *file* is different than the MD5 of the last
> *file* then
> the document *MAY* be different
>
> B)      If the MD5 of the new *file* is identical to the MD5 of the last
> *file* then
> the document *MUST* be the same.
>
>
>
> This, however, does not solve the question of "did you store the same
> document I sent you"
>
> I suggest that question is conceptually invalid or misguided in the first
> place, and to go back
>
> To whomever is asking you that and ask for a clarification.   What do they
> really mean ?
> What is the real goal ?   You can satisfy the "checkbox requirement" by
> storing a "binary" version
>
> Of the document alongside the XML one and checksum that..
> Then you can happily say "Yes" ... but that is all just giving people what
> they asked for not what they need.
>
> The concept of document equality is not the same thing as file byte
> representation - period.
>
> You can do the deep-equals yes.  Its fairly expensive, but is that what
> they want ?
> How can you prove you did it instead of just returning "true" ?
>
> The question/requirement itself means that there is a mismatch in
> understanding of the true needs.
>
> Do they understand your using a database not a document repository ?
> For example if they sent you a CSV file to store in Oracle ... then wanted
> you to prove later it that
> it was stored properly by sending back a checksum ... of what ?  Same
> problem.
>
>
>
> If you are expected to store the "file" as a blob and that’s part of the
> requirements then
> you need to store the file as a "blob" (in ML that would be a binary, or
> large binary  ).
>
> But if the assumption is that "blob" is actually what your querying ...
> that’s simply wrong ... and completely
> useless to ask for a checksum of the file - even if you provided it  who's
> to say your storing a 1MB "blob"
>
> As a binary But your ML XML document you store isn’t just "<bigfile/>" ...
> You've satisfied the customer by giving them what they asked for but have
> not solved any actual real problem,
>
> And just added a bunch of work.  You might as well just store the MD5 of
> the original file and give that back.
>
>
>
> Even if you *did* ONLY store the blob (say in a file and didn’t bother
> with ML at all) - you can satisfy the customer,
> but unless your only requirement was to store the file. But your app could
> just produce random results.
>
>
>
> In general - I find this a classic case of micro-management or
> miss-communication of requirements.
>
> Very common ... a customer wants something but the only way they know to
> express it is in terms of the things they
> know ... so instead of expressing the requirement like "provide a method
> to validated you received the document correctly" and "provide  a method to
> validate that the application is returning the correct results for the
> latest document" they ask "give me a checksum of the file".   You can give
> them what they ask for or you can
>
> Give them what they really need.   But if you just blindly follow what
> they ask for you can end up with vastly more work, a bad product and a
> customer who is unhappy, or misled.
>
>
>
>
>
> When these cases come up - usually the intent is  this
>
> 1)      I want to make sure you *received* the file I sent correctly:
> This is very valid.
> Then store the MD5, Length, Timestamp as a property object with the
> document.
>
>  2)      If the core requirement is for you to store a "blob" ... that is
> a document garneted
> to be unchanged byte-wise and retrievable in its original exact byte
> repetition
> then store the document as a binary.
> However you won't be able to make much use of it besides returning it.
>
>  3)      If the requirement is to be a "blob store" AND a "Document
> Database" both,
> then you need to store both the binary and the parsed document.
> But when asked for validation - the customer must understand that you're
> not querying the blob,
> you can return it on demand to prove you have it, but you need a different
> definition of equality
> to prove your database document is the same as the blob.
> Also what does that prove ?  You can prove it by end to end testing of
> your application - making sure that any queries produce the expected
> results.   Anything less than that doesn’t prove much.
>
> But you can store both, validate that you stored the blob correctly,
> retrieve it on demand,
> And use the parsed document for querying.  You can explain it similar to
> loading it into memory.
> You can't prove a Java Object (say a tree structure) is exactly the same
> as the blob - that can only be done by higher level application testing.
>
>  4)      If the requirement is that you Store a Blob but Also be able to
> query it and Also be able
> to verify that *what you are using to query* is byte-identical to the blob.
> That question is invalid.    You can't do both, the person asking for it
> needs to understand that
> or you need to not promise to do it because its either impossible or
> pointless.
> Any attempt to provide the customer what they asked for would simply be a
> lie.
>
>  5)      Mixed with all these, many people misunderstand documents
> entirely - and actually believe that the file format "Is The Document" ...
> in all meanings of the word.  This is debatable, but staying away from pure
> philosophy and abstract math -- in practice it generally a
> misunderstanding.   If you open even a .txt file and save it - without
> changes.  And the result has a different checksum, is the document the same
> ?
> What if the only difference is changing CR/LF to LF ? are they the same ?
> What if its changed from ASCII to UTF8 ... are they the same ?
> The answer lies in the context ... "For what purpose do you define
> equality"
>
> Any answer that doesn’t have the context explicit is going to be wrong or
> not useful.
>
>
>
> To end that long story :)
>
> I strongly encourage you find out precisely what the real intent and
> requirements are,
>
> If you don’t, then any solution to solve the problem as stated is very
> likely to be misguided and not solve the real problems.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *David Ennis
> *Sent:* Friday, July 18, 2014 5:23 AM
> *To:* MarkLogic Developer Discussion
> *Subject:* Re: [MarkLogic Dev General] MD5 - Hash Question
>
>
>
> HI.
>
>
>
> For what has been described, perhaps fn:deep-equal can help as it
> more-or-less takes into account the 'pitfalls' listed by David Lee:
>
>
>
> - for attributes, order does not matter
>
> - whitepace is ignored in element definitions as well as between elements
> (essentially ignored if not an atomoc value, I would assume)
>
> - (presumably because the comparison is done on the internal
> representation), then it is also true that things like single-quote or
> double-quote make no difference (not the case if you were doing a hash
> check)
>
>
>
> Therefore, document are equal if all of the nodes, child nodes, etc are
> present with the exact same attributes and the same values in all places -
> while ignoring order of attributes as well as un-needed whitespace.
>
>
>
> In a Nutshell:
>
> <foo > <bar a="b" c='d'>baz</bar></foo>
> equals
>
> <foo> <bar  c='d' a='b'>baz</bar>    </foo>
>
> (and of course this follows for deep structures as well)
>
>
> So, perhaps a solution would be a check in MarkLogic:
> A =  internal doc
> B = fetch doc from Hadoop via http/odbc/whatever and do a deep-diff on it.
> (or, expose deep-diff via some ML API and then just have some system that
> fetches form hadoop and asks ML if the doc matches the one it has)
>
>
>
> Kind Regards,
>
> David Ennis
>
>
>
> On 17 July 2014 19:56, David Lee <[email protected]> wrote:
>
> I will reply with a shorter version of what came up recently on another
> list.
>
> ML uses UTF8 internally for xs:string which is what xdmp:quote() returns,
>
> But that’s misleading.  Its irrelevant because the encoding of a string is
> not exposed at the XQuery layer.
>
> We could be using anything and the behavior would be the same.
>
> The only time encoding comes into play is when serializing or
> desterilizing from *bytes* (i.e to/from a file or text document)
>
>
>
> Unless your using text or binary documents then generating a hash on them
> is pointless.
>
> You're not going to get the same hash as the original file .  This is true
> for any processor of XML or JSON or structured documents that does any
> parsing or serializing.
>
>
>
> Documents stored in the database are not byte-equal to the source document.
>
> This is true at multiple levels.   At the "store on the disk" level a
> "Document" doesn’t resemble the source document at all,
>
> Any more than a CSV file resembles the block structure of an Oracle
> partition - let alone byte equal.
>
> For some primitive document types - namely binary, what you get back
> should be equal to what you put in - exactly, But even text documents might
> undergo character set translation , Unicode normalization so what you get
> back may Not be byte equal ...
>
> Structured docs are more complex,  XML and JSON first undergo charset
> translation like text, then they are parsed Into an internal node structure
> - and then stored in a very concise format.
>
> Even ignoring the disk format and MarkLogic ... ALL XML and JSON
> processors share this issue.
>
> The Text Serialized form of a document is not the same thing as its
> value.   You can rarely, if ever, do a round trip
>
> On JSON or XML and get back byte for byte what you started with - Its
> *critical* to understand that the byte/text format
>
> Of documents is the transport layer, not the document model itself.  And
> with that concept, there are many equivalent ways to express the same
> document model.   Very simple example, in XML attributes have no ordering
> guarantee (nor in JSON),
>
> Spaces between attributes like <foo  a="b"     c="d"/>  are ignored so
> <foo c="d" a="b/> is the same document.
>
>
>
> To wrap this up, calculating a hash of a document before storing it , and
> after retrieving it doesn’t give you the answer you want.
>
> The hashes will almost certainly be different - whether or not the
> document "is the same thing" ...
>
>
>
> To provide better advise on how to compare for document equality I need
> more specifics such as what format the document is in, and how you want to
> define equality.
>
>
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *Sujith
> *Sent:* Thursday, July 17, 2014 11:20 AM
> *To:* MarkLogic Developer Discussion
> *Subject:* [MarkLogic Dev General] MD5 - Hash Question
>
>
>
> What is the default encoding that ML uses for xdmp:quote().
>
>
>
> There is a daily job that loads hadoop ( Cloudera Dist ) with the files
> that we have in ML using mlcp. Now we want to compare if both of them are
> in Sync, so we are using md5 hash for validation. Initially we provided
> Hadoop with our Hash and they came back saying that it didn't match with
> their data. After doing some analysis we figured out that we should
> explicitly specify the encoding as UTF-8   option in xdmp:quote as Java
> Program on their end is doing the same.
>
>
>
> (: The Hash that didnot match :)
>
> xquery version "1.0-ml";
>
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml") )
>
>
>
>
>
> In other words what would be the default encoding xdmp:quote uses ( My
> assumpotion is that by Default ML saves Data as UTF-8 encoding is no
> encoding is specified and while it retrieves the documents the same would
> be used. )
>
>
>
>
>
> (: the Hash that Match :)
>
>
>
> xquery version "1.0-ml";
>
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">
>
>       <output-encoding>utf-8</output-encoding>
>
>       <omit-xml-declaration>yes</omit-xml-declaration>
>
>     </options>))
>
>
>
> Any insight is very much appreciated.
>
>
>
>
>
> --
>
> Thanks & Regards
> SujithMaram
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>


-- 
Thanks & Regards
SujithMaram

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] MD5 - Hash Question

Reply via email to