RE: [MarkLogic Dev General] Re: Fwd: performance question

Helen Chen Fri, 06 Jul 2007 12:54:49 -0700

Hi Danny,

I like the first idea of using cts:element-query because I cannot change the
structure.  I'll try it.


Thanks for your help.

Helen

>>> "Danny Sokolsky" <[EMAIL PROTECTED]> 07/06/07 2:24 PM >>>
Hi Helen,

I don't believe there is a way to limit the lexicon search to a specific
XPath, but if your content is fragmented in such a way that the author
elements you want are all in the same fragment, then I think you can
make this work.  The $query option of cts:element-values (and -match)
will return only entries that exist in the fragment matching the query.
You can then constrain the lexicon lookup to a cts:element-query of a
parent element to the author element.  Depending on how your content is
structured, I think this would work (or if your content is not
structured this way, you might be able to add this structure).  For
example, if your content looks like this:

<article>
  <article-info>
    <author> ..... </author>
  </article-info>
  <article-content> .... </article-content>
</article>

Then you can set article-info as a fragment root, and your lexicon query
would look something like:

cts:element-values(xs:QName("author"), "", (), 
   cts:element-query(xs:QName("article-info"), 
                     cts:directory-query("/your/article/directory/")))

Another approach is to pre-process your content and add a uniquely named
element with the author names, than create the range index on that
element.

Good luck!
-Danny
 

-----Original Message-----
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Helen Chen
Sent: Friday, July 06, 2007 8:31 AM
To: [email protected] 
Cc: Helen Chen
Subject: [MarkLogic Dev General] Re: Fwd: performance question


Hi Danny,

I tried to build some index, but I didn't use the
cts:element-value-match APIs. I'll try that.

Also I have a problem for building index, for example, if we have author
element in different places, but when I pull data, I only need the
author element from one specific xpath, is there any way to build index
just for that xpath instead of all the author elements? because the rest
author element is not for my search, and there are a lot of them and
can use a lot of disk space.

Thanks, Helen

----------------------------------------------------------
>From dsokolsky at marklogic.com  Thu Jul  5 18:04:15 2007
From: dsokolsky at marklogic.com (Danny Sokolsky)
Date: Thu Jul  5 18:02:20 2007
Subject: [MarkLogic Dev General] performance question
In-Reply-To: <[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED]>

Hi Helen,

Have you tried using value lexicons to do this?  Value lexicons use
range indexes and the cts:element-values and cts:element-value-match
APIs.  There are several ways you might accomplish this with lexicons.

For example, you could create 3 string range indexes, one on each of the
surname, fname, and midname elements.  This will likely be much faster
than doing the distinct values. You could use cts:element-value-match to
get the a*, b*, etc functionality too.

If you need to do the whole string-joined name combining the 3 elements
thing, you might be able to (depending on whether there is complexity to
the author element that you did not mention) build a lexicon (string
range index) on the author element.  This will give you the values
equivalent to doing an fn:data of the author element.  Your query for
the list of unique authors would then be something like:

cts:element-values(xs:QName("author"))

You could also use the 3.2 frequency feature to find how many of each
there are.  Lexicons are very cool. 

-Danny

-----Original Message-----
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Helen Chen
Sent: Thursday, July 05, 2007 2:04 PM
To: [email protected] 
Subject: [MarkLogic Dev General] performance question


I'm trying to create an author index list for the articles in the whole
volume, which means we may have 20000 articles in one volume, and each
article has more than one author. The author name element structure is :

<author><surname>..</surname><fname>..</fname><midname>..</midname>
the number of midname can be 0 or more than 1.

since this is the index, the result will include all the articles under
the volume directory, and the important step for me is to create unique
author name list.  but I found this takes very long time:

   let $article := cts:search(/article, 
cts:directory-query("/journal/coden/vol/","infinity"))
   let $author := $article/front/authgrp/author

   for $surname in distinct-values($author/surname),
        $fname in distinct-values( $author[surname=$surname]/fname ),
        $midname in distinct-values( if(fn:exists($author/middlename)  )
                                                     then
 
fn:string-join( for
$m in $author[surname = $surname and fname = $fname]/middlename   return
$m/text(),       " ")
                                                     else ()
                                                   )
   return
      <author>
        <surname>{ $surname }</surname>
        <fname>{ $fname }</fname>
         {
           if(fn:empty($midname)) then ()
           else  <midname>{$midname}</midname>
        }
  </author>

So I'm thinking that I can break the surname with starting letters, I
did the following logic: I loop through 26 letters go get result, the
problem is: if just for one letter, it is kind of quick (still about 10
seconds), but with 26 letters, it somehow takes about 8 minutes, it is
much better than  the first solution, but it is still too long for me.

<result>
{
   let $article := cts:search(/article, 
cts:directory-query("/journal/APPLAB/vol_89/","infinity") )

   for $letter in ("a","b","c","d","e","f","g","h","i","j","k","l","m",
                 "n","o","p","q","r","s","t","u","v","w","x","y","z")
   let $author := $article/front/authgrp/author[fn:starts-with(surname,
$letter, "http://marklogic.com/collation//S2"; )]
   return
    for $surname in distinct-values($author/surname),
          $fname in distinct-values( $author[surname=$surname]/fname ),
          $midname in distinct-values( if(fn:exists($author/middlename)
)
                                       then
                                         fn:string-join(for $m in
$author[surname = $surname and fname = $fname]/middlename
                                                return $m/text(),
                                                " ")
                                       else ()
                                      )
       return
       <author>
        <surname>{ $surname }</surname>
        <fname>{ $fname }</fname>
        {
           if(fn:empty($midname)) then ()
           else  <midname>{$midname}</midname>
        }
       </author>
   
}
</result>


Does anyone have suggestions how I should deal with it?

also another small problem, I prefer if no midname, no output, but this
code print out the empty node for midname is no midname exists. Can
someone tell me how to avoid printing out the empty midname?


Thanks, Helen
_______________________________________________
General mailing list
[email protected] 
http://xqzone.com/mailman/listinfo/general 

_______________________________________________
General mailing list
[email protected] 
http://xqzone.com/mailman/listinfo/general 
_______________________________________________
General mailing list
[email protected] 
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Re: Fwd: performance question

Reply via email to