Mike, Thanks for your response. The first query is faster than the one I was using. However, xquery time-out still happens for directories with exceptionally large number of docs (100,000 in number). To avoid this, I am using xdmp:set-request-time-limit() for the query (by setting the limit to the max. limit i.e. 3600s).
Regards, Nachiketa -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Blakeley Sent: Wednesday, December 18, 2013 11:30 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] QUERY to search particular docs There seems to be a typo or two in your test case, but it looks like you're interested in identifying any documents in a specified directory that have duplicate /doc/parent/child/@value values. Now, I'm not sure that the document URI is really what you want to return. You haven't told us what the end goal is, so you're almost certain to get sub-optimal suggestions. But you should be able to figure it out. Duplicate values are an area where value indexing can't do much for you, because value indexing only records that a specific value is in the document. But we can minimize the number of database lookups and make each one as efficient as possible. In this query we go to the database just once, for all the XML in the directory that has a 'doc' root element. xquery version "1.0-ml"; declare variable $DIRECTORY as xs:string external ; for $doc in xdmp:directory($DIRECTORY, 'infinity')/doc let $list := $doc/parent/child where ( some $v in distinct-values($list/@value) satisfies count($list[@value eq $v]) gt 1) return xdmp:node-uri($doc) This is still a "boil the ocean" approach, but it should be more efficient than what you were doing before. If that's still too slow, try shifting the work to an element-attribute range index. For example: for $co in cts:value-co-occurrences( cts:uri-reference(), cts:element-attribute-reference( xs:QName('child'), xs:QName('value')), ('frequency-order', 'descending', 'item-frequency'), cts:directory-query($DIRECTORY, 'infinity')) where cts:frequency($co) gt 1 return $co/cts:value[1]/string() You might try running this without the where clause and return $co, to see what it's doing. Read http://docs.marklogic.com/cts:value-co-occurrences for documentation, and read up on any other functions you don't know about. This is still boiling the ocean, but it's a smaller ocean because it's accessing the element-attribute index and the URI lexicon directly. With a little more work you might get it to evaluate lazily and short-circuit when the frequency drops below 2. There are also games to be played with the 'map' option. If none of that is fast enough, try shifting the work to update time. Enrich each document with an element or a collection that tells you whether or not it has duplicate values. Then your query could become a simple cts:uris() lookup with a cts:query that matches the "duplicate-values" marker. -- Mike On 18 Dec 2013, at 06:24 , Nachiketa Kulkarni <[email protected]> wrote: > Hi, > > I need to get the URIs of a set of documents from the database with the below > pattern: > > <doc> > <parent value="p1"> > <child value="q1"/> > <child value="q1"/> > <child value="q1"/> > .... > </parent> > <parent value="p2"> > <child value="q2"/> > <child value="q2"/> > <child value="q3"/> > <child value="q3"/> > .... > </parent> > .... > </doc> > > The documents will have multiple entries of <child/> with the same value > attribute and may appear consecutively. > > For this, the below XQURY is written: > > fn:distinct-values(for $x in cts:uri-match("(: directory name :)") (: > directory is used to limit the search :) > for $a in doc($x)//parent (: > get the sequence of all parent elements from the document :) > let $b:=$a/child (: get the > sequence of children from the parent :) > for $y in 1 to count($b) (: > iterate over the length of sequence :) > where (for $z in $y+1 to > count($b) (: iterate from the next index to identify a similar child :) > where > ($b[$y]/@value=$b[$z]/@value) > return $x) != > "" (: if any such sequence is found, it won't be empty :) > return $x) (: return > the document URI :) > > However, the above XQUERY results into time limit exceed exception if the > directory size is big (and hence searching such documents in the entire > database is not possible). > > Please suggest an alternative way to make the search of such documents faster. > > N.B. Indexing is done for @value attribute of parent. > > Regards, > Nachiketa > > **************** CAUTION - Disclaimer ***************** This e-mail > contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for > the use of the addressee(s). If you are not the intended recipient, > please notify the sender by e-mail and delete the original message. > Further, you are not to copy, disclose, or distribute this e-mail or > its contents to any other person and any such actions are unlawful. > This e-mail may contain viruses. Infosys has taken every reasonable > precaution to minimize this risk, but is not liable for any damage you > may sustain as a result of any virus in this e-mail. You should carry > out your own virus checks before opening the e-mail or attachment. > Infosys reserves the right to monitor and review the content of all > messages sent to or from this e-mail address. Messages sent to or from this > e-mail address may be stored on the Infosys e-mail system. > ***INFOSYS******** End of Disclaimer ********INFOSYS*** > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
