[MarkLogic Dev General] Optimal strategy for finding duplicate values in database

David Sewell Fri, 01 Feb 2013 18:44:48 -0800

Given a database with lots of files containing attribute values that are 
supposed to be unique across the database, is there an optimal 
MarkLogic-ish way to check for duplicates?


One traditional approach to finding non-distinct values performs 
terribly:

   for $value in distinct-values(collection()//my:element/@my:attribute)
   return $value[count($values[. = $value]) > 1]

(where by "terribly" I mean on the order of 10 seconds elapsed time for 
5000 values on my system). Leveraging an element-attribute range index 
and running cts:search() on the distinct values was somewhat better, but 
not enough.

By far the most performant approach I have found is to iterate over a
sorted sequence of values, simulating a Unix "sort < file | uniq -d",
like so:

    let $ordered_values :=
       for $v in collection()//my:element/@my:attribute
       order by $v
       return $v
    for $val at $pos in $ordered_values
    return
      if ($val eq $ordered_values[$pos - 1])
      then $val
      else ()

where "performant" means around 0.5 seconds for 100000 values.

Is this the best approach? Given that the attributes in question are in
an element-attribute range index, is there another strategy worth
trying?

David

-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Optimal strategy for finding duplicate values in database

Reply via email to