Hi Pragya, You are applying xs:string on the pdf document. Try skipping that, just apply sha1 on the binary doc directly..
Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of "Kapoor, Pragya" <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Wednesday, November 30, 2016 at 6:37 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Hash of pdf Thanks Greet. However, the requirement is as follows: The Hash tag key is of document is based on the number bytes in file. We have both text as well as image pdfs,We want to find out the hash of the pdf (uniqueness of pdf) using SHA1 hash I am using the below xquery code: let $x := fn:doc("/pdfs/0000104.pdf") let $doc :=xs:string($x) return xdmp:sha1($doc,"hex") Output is 0a183862beed888850cd1d327768aac6a2980535 whereas if we use the below java code the output is different import java.io.FileInputStream; import java.security.MessageDigest; public static String sha1(final File file) throws NoSuchAlgorithmException, IOException { final MessageDigest messageDigest = MessageDigest.getInstance("SHA1"); try (InputStream is = new BufferedInputStream(new FileInputStream(file))) { final byte[] buffer = new byte[1024]; for (int read = 0; (read = is.read(buffer))! = -1;) { messageDigest.update(buffer, 0, read); } } // Convert the byte to hex format try (Formatter formatter = new Formatter()) { for (final byte b : messageDigest.digest()) { formatter.format("%02x", b); } return formatter.toString(); } } I need to generate the same hash, which java code is generating in MarkLogic. Please suggest. Thanks Pragya ________________________________ From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> on behalf of Geert Josten <[email protected]<mailto:[email protected]>> Sent: Tuesday, November 15, 2016 3:55:12 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Hash of pdf Hi Pragya, Would you prefer to detect the uniqueness of the pdf entirely, or detect textual changes only? For the first you’d just take the hash of the binary, that allows finding duplicates. For the latter you could apply xdmp:pdf-convert, or xdmp:document-filter. Both should return textual contents as HTML. Note though that text is not extracted from PDF images.. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of "Kapoor, Pragya" <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Tuesday, November 15, 2016 at 10:15 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Hash of pdf Thanks Greet But how would the text of pdf could be converted to string, so that it could be used in xdmp:sha256<http://docs.marklogic.com/xdmp:sha256>, using xquery. ________________________________ From:[email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> on behalf of Geert Josten <[email protected]<mailto:[email protected]>> Sent: Tuesday, November 15, 2016 2:10:33 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Hash of pdf Hi Pragya, I’d use http://docs.marklogic.com/xdmp:sha256, or if you like to combine it with a secretkey, xdmp:hmac-sha256.. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of "Kapoor, Pragya" <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Tuesday, November 15, 2016 at 7:53 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: [MarkLogic Dev General] Hash of pdf Hi, We need to find the hash of pdf file text. Is this possible in MarkLogic/xquery? Thanks Pragya "This e-mail and any attachments transmitted with it are for the sole use of the intended recipient(s) and may contain confidential , proprietary or privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful." "This e-mail and any attachments transmitted with it are for the sole use of the intended recipient(s) and may contain confidential , proprietary or privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful." "This e-mail and any attachments transmitted with it are for the sole use of the intended recipient(s) and may contain confidential , proprietary or privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful."
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
