Since you have a range index I believe you can do something like this:
cts:element-attribute-values(xs:QName('my:element'),xs:QName('my:attribute'))[cts:frequency(.)
gt 1]
You would need a second query to actually retrieve the documents with
duplicate ids, but still probably more efficient.
-Ryan Dew
On Fri, Feb 1, 2013 at 7:44 PM, David Sewell <[email protected]> wrote:
> Given a database with lots of files containing attribute values that are
> supposed to be unique across the database, is there an optimal
> MarkLogic-ish way to check for duplicates?
>
> One traditional approach to finding non-distinct values performs
> terribly:
>
> for $value in distinct-values(collection()//my:element/@my:attribute)
> return $value[count($values[. = $value]) > 1]
>
> (where by "terribly" I mean on the order of 10 seconds elapsed time for
> 5000 values on my system). Leveraging an element-attribute range index
> and running cts:search() on the distinct values was somewhat better, but
> not enough.
>
> By far the most performant approach I have found is to iterate over a
> sorted sequence of values, simulating a Unix "sort < file | uniq -d",
> like so:
>
> let $ordered_values :=
> for $v in collection()//my:element/@my:attribute
> order by $v
> return $v
> for $val at $pos in $ordered_values
> return
> if ($val eq $ordered_values[$pos - 1])
> then $val
> else ()
>
> where "performant" means around 0.5 seconds for 100000 values.
>
> Is this the best approach? Given that the attributes in question are in
> an element-attribute range index, is there another strategy worth
> trying?
>
> David
>
> --
> David Sewell, Editorial and Technical Manager
> ROTUNDA, The University of Virginia Press
> PO Box 400314, Charlottesville, VA 22904-4314 USA
> Email: [email protected] Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general