Thanks all for such a quick response.

________________________________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of David Lee 
<david....@marklogic.com>
Sent: Wednesday, November 30, 2016 7:30:39 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Hash of pdf

PDF's are binary
xs:string() converts to a string representation which will not be byte for byte 
identical to the binary -- by far.
You can validate easily by returning just

   xs:string(fn:doc("/binary.pdf")) -->  
"255044462D312E340A25D3EBE9E10A312030206F626A0A3C3C2F43726561746F7220284D6F7A696C6C612F352E30205C2857"



( text not binary)

Instead use the binary node type not string

https://docs.marklogic.com/guide/app-dev/binaries

-----

Example: I loaded a pdf  /mypdf.pdf
sh-4.3$ openssl dgst -sha256 mypdf.pdf
SHA256(mypdf.pdf)= 
722393c877de277c88f38b51253330d6b3b6189a535888e13e69057ced1ed135
sh-4.3$ ls -l mypdf.pdf
-rwx------+ 1 DLEE None 106309 Nov 25 18:22 mypdf.pdf


In ML QConsole:


fn:string-length(xs:string(fn:doc("/mypdf.pdf"))),
xdmp:binary-size(fn:doc("/mypdf.pdf")/binary()),
xdmp:sha256(xs:string(fn:doc("/mypdf.pdf"))),
xdmp:sha256(fn:doc("/mypdf.pdf")/binary())
fn:substring(xs:string(fn:doc("/mypdf.pdf")),1,100)
----

212618
106309
5edfda73dc5dbf21dc8c605aa8b2b2b804626207b100febfffd38b9fb5d2058b
722393c877de277c88f38b51253330d6b3b6189a535888e13e69057ced1ed135

255044462D312E340A25D3EBE9E10A312030206F626A0A3C3C2F43726561746F7220284D6F7A696C6C612F352E30205C2857






From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Kapoor, Pragya
Sent: Wednesday, November 30, 2016 12:37 AM
To: MarkLogic Developer Discussion <general@developer.marklogic.com>
Subject: Re: [MarkLogic Dev General] Hash of pdf


Thanks Greet.



However, the requirement is as follows:



The Hash tag key is of document is based on the number bytes in file.

We have both text as well as image pdfs,We want to find out the hash of the pdf 
(uniqueness of pdf) using SHA1 hash



I am using the below xquery code:
let $x := fn:doc("/pdfs/0000104.pdf")
let $doc :=xs:string($x)
return xdmp:sha1($doc,"hex")

Output is 0a183862beed888850cd1d327768aac6a2980535


whereas if we use the below java code the output is different



import java.io.FileInputStream;

import java.security.MessageDigest;

public static String sha1(final File file) throws NoSuchAlgorithmException, 
IOException {

final MessageDigest messageDigest = MessageDigest.getInstance("SHA1");

try (InputStream is = new BufferedInputStream(new FileInputStream(file))) {

final byte[] buffer = new byte[1024];

for (int read = 0; (read = is.read(buffer))! = -1;) {

messageDigest.update(buffer, 0, read);

}

}

// Convert the byte to hex format

try (Formatter formatter = new Formatter()) {

for (final byte b : messageDigest.digest()) {

formatter.format("%02x", b);

}

return formatter.toString();

}

}



I need to generate the same hash, which java code is generating in MarkLogic.



Please suggest.



Thanks

Pragya

________________________________
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Geert Josten 
<geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Sent: Tuesday, November 15, 2016 3:55:12 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Hash of pdf

Hi Pragya,

Would you prefer to detect the uniqueness of the pdf entirely, or detect 
textual changes only? For the first you’d just take the hash of the binary, 
that allows finding duplicates. For the latter you could apply 
xdmp:pdf-convert, or xdmp:document-filter. Both should return textual contents 
as HTML.

Note though that text is not extracted from PDF images..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Kapoor, Pragya" 
<pkapo...@innodata.com<mailto:pkapo...@innodata.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, November 15, 2016 at 10:15 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] Hash of pdf


Thanks Greet

But how would the text of pdf could be converted to string, so that it could be 
used in xdmp:sha256<http://docs.marklogic.com/xdmp:sha256>, using xquery.

________________________________
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Geert Josten 
<geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Sent: Tuesday, November 15, 2016 2:10:33 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Hash of pdf

Hi Pragya,

I’d use http://docs.marklogic.com/xdmp:sha256, or if you like to combine it 
with a secretkey, xdmp:hmac-sha256..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Kapoor, Pragya" 
<pkapo...@innodata.com<mailto:pkapo...@innodata.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, November 15, 2016 at 7:53 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: [MarkLogic Dev General] Hash of pdf


Hi,



We need to find the hash of pdf file text.

Is this possible in MarkLogic/xquery?



Thanks

Pragya
"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."
"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."
"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."
"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to