I have a case where I need to search our MarkLogic database for millions of GUIDs. Specifically, I have sitemaps loaded into a directory in MarkLogic broken into chunks of 100,000 values each that look like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>000001c10ede4faaaf3b53ad0c0d4172.0</loc></url> <url><loc>036501c10ede4faaaf3b73ad0c0d4168.0</loc></url> <url><loc>098701c10ede4faaaf3b83ad0c0d4102.0</loc></url> </urlset> I have a total of about 44 million values to search for. The values in <loc> may exist as the value of an one or more times element in one or more of about 28.5 million XML documents in MarkLogic. What I'd like to do is return a list of values that exist in the sitemaps but are not found in the database. I have tried several approaches including (at a high level without going into detail): * Using straight cts:search() * Writing the values from the documents in the database to document properties and doing a cts:properties-query() * Using xdmp:exists Here's a sample of one of the queries I tried using (the most successful when paired with the content processing framework): xquery version "1.0-ml"; declare namespace nml="http://iptc.org/std/nar/2006-10-01/"; declare namespace sm="http://www.sitemaps.org/schemas/sitemap/0.9"; for $g in xdmp:directory("/sitemaps/", "infinity")//sm:loc let $uris := cts:element-value-query(xs:QName("nml:itemIdSequence"), xs:string($g)) return if (xdmp:exists(cts:search(fn:doc()//nml:newsMessage, $uris, "unfiltered"))) then () else xdmp:node-insert-child(doc("/comparisons/missing-items.xml")/missing, <loc>{xs:string($g)}</loc>) I've tried running this query by: * Using CORB * Creating a CPF scenario where I batched the sitemaps into smaller sitemaps of 250 <loc> items a piece and then kicked off the query above * Plugging this into QConsole and running it. My experience has been: * CORB times out * CPF works on the smaller batches, but is very, very slow. * QConsole of course times out Can anyone suggest other ways I might go about this? I'm looking for a method that would take at most hours and not weeks (which is how long my CPF setup would take if I were to use it) to search for the 44 million values. The element value in the documents I'm searching against in the database is NOT indexed at this time, but I would be willing to look into having it indexed if it seems like it would greatly increase the speed of what I'm trying to do. Thanks in advance for any feedback and ideas, Zach Dunlap [emaillogo] Business Product Analyst Information Management The Associated Press The information contained in this communication is intended for the use of the designated recipients named above. If the reader of this communication is not the intended recipient, you are hereby notified that you have received this communication in error, and that any review, dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify The Associated Press immediately by telephone at +1-212-621-1898 and delete this email. Thank you. [IP_US_DISC] msk dccc60c6d2c3a6438f0cf467d9a4938
<<inline: image001.png>>
_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general