Re: [MarkLogic Dev General] Hash of pdf

Geert Josten Tue, 15 Nov 2016 02:26:20 -0800

Hi Pragya,

Would you prefer to detect the uniqueness of the pdf entirely, or detect 
textual changes only? For the first you’d just take the hash of the binary, 
that allows finding duplicates. For the latter you could apply 
xdmp:pdf-convert, or xdmp:document-filter. Both should return textual contents 
as HTML.


Note though that text is not extracted from PDF images..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Kapoor, Pragya" 
<pkapo...@innodata.com<mailto:pkapo...@innodata.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, November 15, 2016 at 10:15 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] Hash of pdf


Thanks Greet

But how would the text of pdf could be converted to string, so that it could be 
used in xdmp:sha256<http://docs.marklogic.com/xdmp:sha256>, using xquery.

________________________________
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Geert Josten 
<geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Sent: Tuesday, November 15, 2016 2:10:33 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Hash of pdf

Hi Pragya,

I’d use http://docs.marklogic.com/xdmp:sha256, or if you like to combine it 
with a secretkey, xdmp:hmac-sha256..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Kapoor, Pragya" 
<pkapo...@innodata.com<mailto:pkapo...@innodata.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, November 15, 2016 at 7:53 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: [MarkLogic Dev General] Hash of pdf


Hi,


We need to find the hash of pdf file text.

Is this possible in MarkLogic/xquery?


Thanks

Pragya

"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."
"This e-mail and any attachments transmitted with it are for the sole use of 
the intended recipient(s) and may contain confidential , proprietary or 
privileged information. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message. Any 
unauthorized review, use, disclosure, dissemination, forwarding, printing or 
copying of this e-mail or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful."

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Hash of pdf

Reply via email to