Re: [MarkLogic Dev General] QUERY to search particular docs

Nachiketa Kulkarni Fri, 20 Dec 2013 05:42:19 -0800

Mike,

Thanks for your response. The first query is faster than the one I was using. 
However, xquery time-out still happens for directories with exceptionally large 
number of docs (100,000 in number). To avoid this, I am using 
xdmp:set-request-time-limit() for the query (by setting the limit to the max. 
limit i.e. 3600s).


Regards,
Nachiketa

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Wednesday, December 18, 2013 11:30 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] QUERY to search particular docs

There seems to be a typo or two in your test case, but it looks like you're 
interested in identifying any documents in a specified directory that have 
duplicate /doc/parent/child/@value values. Now, I'm not sure that the document 
URI is really what you want to return. You haven't told us what the end goal 
is, so you're almost certain to get sub-optimal suggestions. But you should be 
able to figure it out.

Duplicate values are an area where value indexing can't do much for you, 
because value indexing only records that a specific value is in the document. 
But we can minimize the number of database lookups and make each one as 
efficient as possible. In this query we go to the database just once, for all 
the XML in the directory that has a 'doc' root element.

    xquery version "1.0-ml";
    declare variable $DIRECTORY as xs:string external ;
    
    for $doc in xdmp:directory($DIRECTORY, 'infinity')/doc
    let $list := $doc/parent/child
    where (
      some $v in distinct-values($list/@value)
      satisfies count($list[@value eq $v]) gt 1)
    return xdmp:node-uri($doc)

This is still a "boil the ocean" approach, but it should be more efficient than 
what you were doing before. If that's still too slow, try shifting the work to 
an element-attribute range index. For example:

    for $co in cts:value-co-occurrences(
      cts:uri-reference(),
      cts:element-attribute-reference(
        xs:QName('child'), xs:QName('value')),
      ('frequency-order', 'descending', 'item-frequency'),
      cts:directory-query($DIRECTORY, 'infinity'))
    where cts:frequency($co) gt 1
    return $co/cts:value[1]/string()

You might try running this without the where clause and return $co, to see what 
it's doing. Read http://docs.marklogic.com/cts:value-co-occurrences for 
documentation, and read up on any other functions you don't know about. This is 
still boiling the ocean, but it's a smaller ocean because it's accessing the 
element-attribute index and the URI lexicon directly. With a little more work 
you might get it to evaluate lazily and short-circuit when the frequency drops 
below 2. There are also games to be played with the 'map' option.

If none of that is fast enough, try shifting the work to update time. Enrich 
each document with an element or a collection that tells you whether or not it 
has duplicate values. Then your query could become a simple cts:uris() lookup 
with a cts:query that matches the "duplicate-values" marker.

-- Mike

On 18 Dec 2013, at 06:24 , Nachiketa Kulkarni <[email protected]> 
wrote:

> Hi,
>  
> I need to get the URIs of a set of documents from the database with the below 
> pattern:
>  
> <doc>
>    <parent value="p1">
>         <child value="q1"/>
>         <child value="q1"/>
>         <child value="q1"/>
>                 ....
>    </parent>
>    <parent value="p2">
>         <child value="q2"/>
>         <child value="q2"/>
>         <child value="q3"/>
>         <child value="q3"/>
>                 ....
>    </parent>
>     ....
> </doc>
>  
> The documents will have multiple entries of <child/> with the same value 
> attribute and may appear consecutively.
>  
> For this, the below XQURY is written:
>  
> fn:distinct-values(for $x in cts:uri-match("(: directory name :)") (: 
> directory is used to limit the search :)
>                                                 for $a in doc($x)//parent  (: 
> get the sequence of all parent elements from the document :)
>                                                 let $b:=$a/child (: get the 
> sequence of children from the parent :)
>                                                  for $y in 1 to count($b) (: 
> iterate over the length of sequence :)
>                                                 where (for $z in $y+1 to 
> count($b) (: iterate from the next index to identify a similar child :)
>                                                                 where 
> ($b[$y]/@value=$b[$z]/@value)
>                                                                 return $x) != 
> "" (: if any such sequence is found, it won't be empty :)
>                                                  return $x) (: return 
> the document URI :)
>  
> However, the above XQUERY results into time limit exceed exception if the 
> directory size is big (and hence searching such documents in the entire 
> database is not possible).
>  
> Please suggest an alternative way to make the search of such documents faster.
>  
> N.B. Indexing is done for @value attribute of parent.
>  
> Regards,
> Nachiketa
>  
> **************** CAUTION - Disclaimer ***************** This e-mail 
> contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for 
> the use of the addressee(s). If you are not the intended recipient, 
> please notify the sender by e-mail and delete the original message. 
> Further, you are not to copy, disclose, or distribute this e-mail or 
> its contents to any other person and any such actions are unlawful. 
> This e-mail may contain viruses. Infosys has taken every reasonable 
> precaution to minimize this risk, but is not liable for any damage you 
> may sustain as a result of any virus in this e-mail. You should carry 
> out your own virus checks before opening the e-mail or attachment. 
> Infosys reserves the right to monitor and review the content of all 
> messages sent to or from this e-mail address. Messages sent to or from this 
> e-mail address may be stored on the Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] QUERY to search particular docs

Reply via email to