You can also look into configuring a field to search on which excludes the element you don’t want to search, or configure word search to ignore the element. I’d recommend reading up on both before deciding which appropriate might be most appropriate.
-fs From: Travis Raybold Reply-To: Discussion Date: Tuesday, September 22, 2015 at 7:53 AM To: Discussion Subject: Re: [MarkLogic Dev General] Avoiding HTML tags in searches Thanks for the tips, folks. Sounds like if I don't want to modify the content of the HTML I return, I will need to store two copies - one to search on and one to return - correct? Might be time to increase the size of our SAN... On Tue, Sep 22, 2015 at 3:23 AM Florent Georges <[email protected]<mailto:[email protected]>> wrote: On 21 September 2015 at 21:42, David Ennis wrote: The challenge you will have is that the HTML in the CDATA is likely indexed as text, so the feature listed needs to be on the element containing the CDATA.. You can use xdmp:tidy() for that. It does a good job for recovery (in cases the HTML is really bad). The only time I had it fail to recover really bad HTML, was when the input contained control characters (which we could remove by acting on the binary or string input, before calling xdmp:tidy().) https://docs.marklogic.com/xdmp:tidy Depending on what you do exactly, you might want the tidied HTML to replace the original one, or rather to sit aside it, so you can send the original input exactly as it was. Regards, -- Florent Georges http://fgeorges.org/ http://h2oconsulting.be/ _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
