Perfect solution. Many thanks (part of the old code that was using
inefficient sequence comparisons and taking entire minutes is now going
to run in about 0.2 seconds)!
David
On Fri, 1 Feb 2013, Ryan Dew wrote:
Since you have a range index I believe you can do something like this:
cts:element-attribute-values(xs:QName('my:element'),xs:QName('my:attribute'))[cts:frequency(.)
gt 1]
You would need a second query to actually retrieve the documents with duplicate
ids, but still
probably more efficient.
-Ryan Dew
On Fri, Feb 1, 2013 at 7:44 PM, David Sewell <[email protected]> wrote:
Given a database with lots of files containing attribute values that are
supposed to be unique across the database, is there an optimal
MarkLogic-ish way to check for duplicates?
One traditional approach to finding non-distinct values performs
terribly:
for $value in distinct-values(collection()//my:element/@my:attribute)
return $value[count($values[. = $value]) > 1]
(where by "terribly" I mean on the order of 10 seconds elapsed time for
5000 values on my system). Leveraging an element-attribute range index
and running cts:search() on the distinct values was somewhat better, but
not enough.
By far the most performant approach I have found is to iterate over a
sorted sequence of values, simulating a Unix "sort < file | uniq -d",
like so:
let $ordered_values :=
for $v in collection()//my:element/@my:attribute
order by $v
return $v
for $val at $pos in $ordered_values
return
if ($val eq $ordered_values[$pos - 1])
then $val
else ()
where "performant" means around 0.5 seconds for 100000 values.
Is this the best approach? Given that the attributes in question are in
an element-attribute range index, is there another strategy worth
trying?
David
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general