You can also look into configuring a field to search on which excludes the 
element you don’t want to search, or configure word search to ignore the 
element. I’d recommend reading up on both before deciding which appropriate 
might be most appropriate.

-fs

From: Travis Raybold
Reply-To: Discussion
Date: Tuesday, September 22, 2015 at 7:53 AM
To: Discussion
Subject: Re: [MarkLogic Dev General] Avoiding HTML tags in searches

Thanks for the tips, folks. Sounds like if I don't want to modify the content 
of the HTML I return, I will need to store two copies - one to search on and 
one to return - correct? Might be time to increase the size of our SAN...

On Tue, Sep 22, 2015 at 3:23 AM Florent Georges 
<[email protected]<mailto:[email protected]>> wrote:
On 21 September 2015 at 21:42, David Ennis wrote:

The challenge you will have is that the HTML in the CDATA is likely indexed as 
text, so the feature listed needs to be on the element containing the CDATA..

You can use xdmp:tidy() for that.  It does a good job for recovery (in cases 
the HTML is really bad).  The only time I had it fail to recover really bad 
HTML, was when the input contained control characters (which we could remove by 
acting on the binary or string input, before calling xdmp:tidy().)

https://docs.marklogic.com/xdmp:tidy

Depending on what you do exactly, you might want the tidied HTML to replace the 
original one, or rather to sit aside it, so you can send the original input 
exactly as it was.

Regards,

--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/


_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to