Given a database with lots of files containing attribute values that are
supposed to be unique across the database, is there an optimal
MarkLogic-ish way to check for duplicates?
One traditional approach to finding non-distinct values performs
terribly:
for $value in distinct-values(collection()//my:element/@my:attribute)
return $value[count($values[. = $value]) > 1]
(where by "terribly" I mean on the order of 10 seconds elapsed time for
5000 values on my system). Leveraging an element-attribute range index
and running cts:search() on the distinct values was somewhat better, but
not enough.
By far the most performant approach I have found is to iterate over a
sorted sequence of values, simulating a Unix "sort < file | uniq -d",
like so:
let $ordered_values :=
for $v in collection()//my:element/@my:attribute
order by $v
return $v
for $val at $pos in $ordered_values
return
if ($val eq $ordered_values[$pos - 1])
then $val
else ()
where "performant" means around 0.5 seconds for 100000 values.
Is this the best approach? Given that the attributes in question are in
an element-attribute range index, is there another strategy worth
trying?
David
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general