Hi Pragya, Would you prefer to detect the uniqueness of the pdf entirely, or detect textual changes only? For the first you’d just take the hash of the binary, that allows finding duplicates. For the latter you could apply xdmp:pdf-convert, or xdmp:document-filter. Both should return textual contents as HTML.
Note though that text is not extracted from PDF images.. Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of "Kapoor, Pragya" <pkapo...@innodata.com<mailto:pkapo...@innodata.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Tuesday, November 15, 2016 at 10:15 AM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: Re: [MarkLogic Dev General] Hash of pdf Thanks Greet But how would the text of pdf could be converted to string, so that it could be used in xdmp:sha256<http://docs.marklogic.com/xdmp:sha256>, using xquery. ________________________________ From: general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com> <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of Geert Josten <geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>> Sent: Tuesday, November 15, 2016 2:10:33 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Hash of pdf Hi Pragya, I’d use http://docs.marklogic.com/xdmp:sha256, or if you like to combine it with a secretkey, xdmp:hmac-sha256.. Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of "Kapoor, Pragya" <pkapo...@innodata.com<mailto:pkapo...@innodata.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Tuesday, November 15, 2016 at 7:53 AM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: [MarkLogic Dev General] Hash of pdf Hi, We need to find the hash of pdf file text. Is this possible in MarkLogic/xquery? Thanks Pragya "This e-mail and any attachments transmitted with it are for the sole use of the intended recipient(s) and may contain confidential , proprietary or privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful." "This e-mail and any attachments transmitted with it are for the sole use of the intended recipient(s) and may contain confidential , proprietary or privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful."
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general