Perfect solution. Many thanks (part of the old code that was using inefficient sequence comparisons and taking entire minutes is now going to run in about 0.2 seconds)!

David

On Fri, 1 Feb 2013, Ryan Dew wrote:

Since you have a range index I believe you can do something like this:
cts:element-attribute-values(xs:QName('my:element'),xs:QName('my:attribute'))[cts:frequency(.)
 gt 1]

You would need a second query to actually retrieve the documents with duplicate 
ids, but still
probably more efficient.

-Ryan Dew


On Fri, Feb 1, 2013 at 7:44 PM, David Sewell <[email protected]> wrote:
      Given a database with lots of files containing attribute values that are
      supposed to be unique across the database, is there an optimal
      MarkLogic-ish way to check for duplicates?

      One traditional approach to finding non-distinct values performs
      terribly:

         for $value in distinct-values(collection()//my:element/@my:attribute)
         return $value[count($values[. = $value]) > 1]

      (where by "terribly" I mean on the order of 10 seconds elapsed time for
      5000 values on my system). Leveraging an element-attribute range index
      and running cts:search() on the distinct values was somewhat better, but
      not enough.

      By far the most performant approach I have found is to iterate over a
      sorted sequence of values, simulating a Unix "sort < file | uniq -d",
      like so:

          let $ordered_values :=
             for $v in collection()//my:element/@my:attribute
             order by $v
             return $v
          for $val at $pos in $ordered_values
          return
            if ($val eq $ordered_values[$pos - 1])
            then $val
            else ()

      where "performant" means around 0.5 seconds for 100000 values.

      Is this the best approach? Given that the attributes in question are in
      an element-attribute range index, is there another strategy worth
      trying?

      David

      --
      David Sewell, Editorial and Technical Manager
      ROTUNDA, The University of Virginia Press
      PO Box 400314, Charlottesville, VA 22904-4314 USA
      Email: [email protected]   Tel: +1 434 924 9973
      Web: http://rotunda.upress.virginia.edu/
      _______________________________________________
      General mailing list
      [email protected]
      http://developer.marklogic.com/mailman/listinfo/general





--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to