I have a case where I need to search our MarkLogic database for millions of 
GUIDs. Specifically, I have sitemaps loaded into a directory in MarkLogic 
broken into chunks of 100,000 values each that look like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"; 
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9";>
<url><loc>000001c10ede4faaaf3b53ad0c0d4172.0</loc></url>
<url><loc>036501c10ede4faaaf3b73ad0c0d4168.0</loc></url>
<url><loc>098701c10ede4faaaf3b83ad0c0d4102.0</loc></url>
</urlset>

I have a total of about 44 million values to search for. The values in <loc> 
may exist as the value of an one or more times element in one or more of about 
28.5 million XML documents in MarkLogic.

What I'd like to do is return a list of values that exist in the sitemaps but 
are not found in the database.

I have tried several approaches including (at a high level without going into 
detail):

*         Using straight cts:search()

*         Writing the values from the documents in the database to document 
properties and doing a cts:properties-query()

*         Using xdmp:exists

Here's a sample of one of the queries I tried using (the most successful when 
paired with the content processing framework):

xquery version "1.0-ml";
declare namespace nml="http://iptc.org/std/nar/2006-10-01/";;
declare namespace sm="http://www.sitemaps.org/schemas/sitemap/0.9";;

for $g in xdmp:directory("/sitemaps/", "infinity")//sm:loc
  let $uris := cts:element-value-query(xs:QName("nml:itemIdSequence"), 
xs:string($g))
  return if (xdmp:exists(cts:search(fn:doc()//nml:newsMessage, $uris, 
"unfiltered"))) then ()
  else xdmp:node-insert-child(doc("/comparisons/missing-items.xml")/missing,
    <loc>{xs:string($g)}</loc>)

I've tried running this query by:

*         Using CORB

*         Creating a CPF scenario where I batched the sitemaps into smaller 
sitemaps of 250 <loc> items a piece and then kicked off the query above

*         Plugging this into QConsole and running it.

My experience has been:

*         CORB times out

*         CPF works on the smaller batches, but is very, very slow.

*         QConsole of course times out

Can anyone suggest other ways I might go about this?

I'm looking for a method that would take at most hours and not weeks (which is 
how long my CPF setup would take if I were to use it) to search for the 44 
million values.

The element value in the documents I'm searching against in the database is NOT 
indexed at this time, but I would be willing to look into having it indexed if 
it seems like it would greatly increase the speed of what I'm trying to do.

Thanks in advance for any feedback and ideas,

Zach Dunlap
[emaillogo]
Business Product Analyst
Information Management
The Associated Press




The information contained in this communication is intended for the use
of the designated recipients named above. If the reader of this 
communication is not the intended recipient, you are hereby notified
that you have received this communication in error, and that any review,
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please 
notify The Associated Press immediately by telephone at +1-212-621-1898 
and delete this email. Thank you.
[IP_US_DISC]

msk dccc60c6d2c3a6438f0cf467d9a4938

<<inline: image001.png>>

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to