[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Laurens van den Oever Wed, 29 Jul 2009 14:58:46 -0700

Hi Kelly,

I've found the solution to the "no element-attribute range index exists for
the given element/attribute QNames" exception I mentioned earlier. The
problem was caused by the third argument being an empty sequence instead of
an empty string:
  cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),
==> () <== ,"item-frequency",$q)
I'm too new to XQuery to understand why that would result in the exception I
got.


Now that I have the query running it looks like it doesn't do what I need it
to do. cts:element-attribute-values does return a list of hash-ids for
paragraphs, but the query searches within the entire fragment. So the result
is all hash-ids in all fragments that match the query instead of the
hash-ids of the paragraphs that contain $q. I'm not entirely sure how to
proceed from here.

I have learned how to use the hash-ids to get a very fast frequency count
for a given id:
  cts:frequency(cts:element-attribute-value-match(xs:QName("paragraph"),
xs:QName("hash-id"), $id, ("item-frequency")))
And to get the paragraphs including mixed content:
  (//paragra...@hash-id eq $id])[1]

Now if I add a element word lexicon should I be able to quicky get a list of
unique ids of paragraphs that contain my search query using the following
query?
fn:distinct-values(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $query))/@hash-id)

I assume this performs better than what I've started with because this works
this ids, reducing the amount of data to work with.
fn:distinct-values(cts:search(//paragraph, cts:field-word-query("paragraph",
$query"),("score-simple"))

Any other things I can try?

Thanks,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

Date: Mon, 27 Jul 2009 10:34:34 -0700
From: Kelly Stirman <[email protected]>
Subject: [MarkLogic Dev General] RE: Sorting by the number of
      occurences of   a paragraph
To: "[email protected]"
      <[email protected]>

Hi Laurent,

If I follow your design correctly, what I would do is the following:

1) iterate over all your paragraphs and use xdmp:md5() to generate a hash
value
2) add this hash value as an attribute to each paragraph, e.g. <paragraph
hash-id="abc123">hello world</paragraph>
3) create a string range index in the codepoint collation on the
paragraph/@hash-id attribute

Then to return paragraphs in frequency order, you can call
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency").
You can filter this list with any search expression by adding another the
cts:query as another option (see below).

This approach allows you to quickly get the hash-id in frequency order, with
or without a cts:query. You'll then need to go get a paragraph that matches
the hash-id. Because there may be many, you can simply grab the first.


let $q:= "search phrase"
for $id in
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q)
return element result {attribute count
{cts:frequency($id)},(//paragra...@hash-id eq $id])[1]}

Finally, before doing any of this, I would get rid of your fragmentation.
You probably don't need fields, but we can continue to talk about how they
might be useful for this task. I also don't think you need to limit to a
specific language, but that shouldn't slow things down if you want to use it
(be sure to look over our developer guide on using languages, and your
server license *may* come into play on this subject).

This should be very fast - well under a second as long as there aren't too
many paragraphs being returned. Getting the hash-ids will be resolved out of
the indexes, whereas each paragraph returned will incur a disk i/o. 100 or
so results should be sub-second.

Kelly

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Reply via email to