There seems to be a typo or two in your test case, but it looks like you're
interested in identifying any documents in a specified directory that have
duplicate /doc/parent/child/@value values. Now, I'm not sure that the document
URI is really what you want to return. You haven't told us what the end goal
is, so you're almost certain to get sub-optimal suggestions. But you should be
able to figure it out.
Duplicate values are an area where value indexing can't do much for you,
because value indexing only records that a specific value is in the document.
But we can minimize the number of database lookups and make each one as
efficient as possible. In this query we go to the database just once, for all
the XML in the directory that has a 'doc' root element.
xquery version "1.0-ml";
declare variable $DIRECTORY as xs:string external ;
for $doc in xdmp:directory($DIRECTORY, 'infinity')/doc
let $list := $doc/parent/child
where (
some $v in distinct-values($list/@value)
satisfies count($list[@value eq $v]) gt 1)
return xdmp:node-uri($doc)
This is still a "boil the ocean" approach, but it should be more efficient than
what you were doing before. If that's still too slow, try shifting the work to
an element-attribute range index. For example:
for $co in cts:value-co-occurrences(
cts:uri-reference(),
cts:element-attribute-reference(
xs:QName('child'), xs:QName('value')),
('frequency-order', 'descending', 'item-frequency'),
cts:directory-query($DIRECTORY, 'infinity'))
where cts:frequency($co) gt 1
return $co/cts:value[1]/string()
You might try running this without the where clause and return $co, to see what
it's doing. Read http://docs.marklogic.com/cts:value-co-occurrences for
documentation, and read up on any other functions you don't know about. This is
still boiling the ocean, but it's a smaller ocean because it's accessing the
element-attribute index and the URI lexicon directly. With a little more work
you might get it to evaluate lazily and short-circuit when the frequency drops
below 2. There are also games to be played with the 'map' option.
If none of that is fast enough, try shifting the work to update time. Enrich
each document with an element or a collection that tells you whether or not it
has duplicate values. Then your query could become a simple cts:uris() lookup
with a cts:query that matches the "duplicate-values" marker.
-- Mike
On 18 Dec 2013, at 06:24 , Nachiketa Kulkarni <[email protected]>
wrote:
> Hi,
>
> I need to get the URIs of a set of documents from the database with the below
> pattern:
>
> <doc>
> <parent value=”p1”>
> <child value=”q1”/>
> <child value=”q1”/>
> <child value=”q1”/>
> ….
> </parent>
> <parent value=”p2”>
> <child value=”q2”/>
> <child value=”q2”/>
> <child value=”q3”/>
> <child value=”q3”/>
> .…
> </parent>
> ….
> </doc>
>
> The documents will have multiple entries of <child/> with the same value
> attribute and may appear consecutively.
>
> For this, the below XQURY is written:
>
> fn:distinct-values(for $x in cts:uri-match("(: directory name :)") (:
> directory is used to limit the search :)
> for $a in doc($x)//parent (:
> get the sequence of all parent elements from the document :)
> let $b:=$a/child (: get the
> sequence of children from the parent :)
> for $y in 1 to count($b) (:
> iterate over the length of sequence :)
> where (for $z in $y+1 to
> count($b) (: iterate from the next index to identify a similar child :)
> where
> ($b[$y]/@value=$b[$z]/@value)
> return $x) !=
> "" (: if any such sequence is found, it won’t be empty :)
> return $x) (: return the
> document URI :)
>
> However, the above XQUERY results into time limit exceed exception if the
> directory size is big (and hence searching such documents in the entire
> database is not possible).
>
> Please suggest an alternative way to make the search of such documents faster.
>
> N.B. Indexing is done for @value attribute of parent.
>
> Regards,
> Nachiketa
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are
> not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for any
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general