[MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil Bradley)

Kelly Stirman Fri, 09 Oct 2009 02:58:00 -0700

HI Neil,

Have you thought about using cts:highlight() to do the replacing of your string 
values? You basically construct a cts:or-query(()) of all the different values 
you'd like to replace:

let $q := cts:or-query(("Doc","ume","nt"))

Then you call cts:highlight() on the document. Normally you would use 
cts:highlight() to replace a matching string with some new markup for style, 
such as a span tag. It turns out you can use it to replace the matching string 
with whatever you want. Where cts:highlight() finds a match, you have some 
useful options. One is the $cts:queries variable, which returns the matching 
query for the text that is matched. You can use this with a lookup document 
like so:

<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

For each match, you'll get back a cts:query, and you can use this to find 
matches in your replace node, and use the substitution string as the value for 
the third argument in cts:highlight():

let $doc :=
<doc>I have some text that includes the words Doc, ume, and nt.</doc>

let $replace :=
<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

let $q := cts:or-query(("Doc","ume","nt"))

return
cts:highlight($doc,$q,local:replace($cts:queries,$replace))

-->

<doc>I have some text that includes the words DOC, UME, and NT.</doc>

This can be extended with cts:reverse-query() to perform custom enrichment on 
XML. Rather than having one large or-query() for all the strings you might want 
to replace, you would store a document with your query and any other useful 
metadata you wish to associate with the query. For example, if you wanted to do 
some custom enrichment on drug names, you might have a series of documents like 
this:

<drug>
  <name type="commercial">Tylenol</name>
  <img type="commercial">/Thumbs/generic/acetamenophin.png</img>
  <name type="generic">Acetamenophin</name>
  <img type="generic">/Thumbs/generic/acetamenophin.png</img>
  <link>http://drugdictionary.com/drugid/j674ui832190</link>

<query>{cts:or-query((cts:word-query("Tylenol","case-insensitive"),cts:word-query("Acetamenophin","case-insensitive")))}</query>
</drug>

And for each document you want to enrich, you would use the reverse indexes to 
see which drugs are in the document. This is a much easier approach to manage 
than an or-query() of thousands of drug names:

cts:search(doc(),cts:reverse-query($new-document))

This would return the matching query documents, and you can then retrieve the 
queries from these docs and pass them to cts:highlight(). Here's how you might 
do that:

let $drug-groups := cts:search(doc(),cts:reverse-query($doc))
let $query := cts:or-query((cts:query($drug-groups/drug/query/*)))
return
  cts:highlight($doc,$query,local:drug-enrich($cts:queries,$drug-groups))

In this case, instead of a single replace document, the new value is one of 
several pieces of metadata you store with each query. You can write your own 
function to build elaborate replacement markup. Here's a simple example for the 
drugs:

declare function local:drug-enrich($query as cts:query,$drug-groups as node()*){
  let $this-drug := $drug-groups/drug/name[cts:contains(.,$query)]
  let $this-type := fn:data($this-drug/@type)
  let $other-type := if($this-type eq "commercial") then "generic" else 
"commercial"
  let $img := fn:data($this-drug/@img)
  let $link := $this-drug/../link/text()
  let $equivalent := $this-drug/../na...@type eq $other-type]/text()
  return <drug img="{$img}" link="{$link}">{$match} [{$equivalent}]</drug>
};

Kelly

Hi,

I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.

I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.

It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML

I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.

However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML

Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?

Thanks,

Neil.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of 
[email protected]
Sent: Friday, October 09, 2009 2:27 AM
To: [email protected]
Subject: General Digest, Vol 64, Issue 25

Send General mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."

Today's Topics:

   1. Performance Meters http test configuration (Curtis Wilde)
   2. Re: Performance Meters http test configuration (Michael Blakeley)
   3. Re: Performance Meters http test  configuration (Curtis Wilde)
   4. To set threshold for search:search results (mano m)
   5. Text Updates Garbage Collection? (Neil Bradley)

----------------------------------------------------------------------

Message: 1
Date: Thu, 8 Oct 2009 16:06:24 -0600
From: Curtis Wilde <[email protected]>
Subject: [MarkLogic Dev General] Performance Meters http test
        configuration
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

The performance meters tutorial does a good job at explaining how to execute
xcc tests with performance meters, but it is less clear how an http test
should work. I've taken a stab at a very simple http test with no success:

<h:script xmlns:h="http://marklogic.com/xdmp/harness";>
    <h:test>
        <h:name>login</h:name>
        <h:set-up/>
        <h:tear-down/>
        <h:comment-expected-result><![CDATA[<response
status="AUTHENTICATED"/>]]>
        </h:comment-expected-result>
        <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
    </h:test>
</h:script>

The test makes a restful call (login) to a service, which should
authenticate the specified user and receive the authenticated status message
reply, but this never succeeds. In the address bar of the browser the call
looks like:

http://myTestServer:8030/login?username=foo&password=bar

properties file:
checkResults=true
host=myTestServer
port=8030
isRandomTest=false
inputPath=../tests/httptests.xml
numThreads=1
shared=false
readSize=32768
recordResults=true
#reporter=XMLReporter
#outputPath=results.xml
reporter=CSVReporter
outputPath=../reports/
reportTime=true
reportPercentileDuration=95
reportStandardDeviation=true
testTime=0
testType=HTTP
testListClass=com.marklogic.performance.XMLFileTestList

Not sure what I'm doing wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/c2c8a698/attachment-0001.html

------------------------------

Message: 2
Date: Thu, 08 Oct 2009 15:46:25 -0700
From: Michael Blakeley <[email protected]>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

Curtis,

Try testType=URI instead. The HTTP test type is more specialized: it
posts the <h:query> value to a special "/evaluate.xqy" service on the
target host. The idea with that test type is to evaluate arbitrary
XQuery expressions.

-- Mike

On 2009-10-08 15:06, Curtis Wilde wrote:
> The performance meters tutorial does a good job at explaining how to execute 
> xcc tests with performance meters, but it is less clear how an http test 
> should work. I've taken a stab at a very simple http test with no success:
>
> <h:script xmlns:h="http://marklogic.com/xdmp/harness";>
>      <h:test>
>          <h:name>login</h:name>
>          <h:set-up/>
>          <h:tear-down/>
>          <h:comment-expected-result><![CDATA[<response 
> status="AUTHENTICATED"/>]]>
>          </h:comment-expected-result>
>          <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>      </h:test>
> </h:script>
>
> The test makes a restful call (login) to a service, which should authenticate 
> the specified user and receive the authenticated status message reply, but 
> this never succeeds. In the address bar of the browser the call looks like:
>
> http://myTestServer:8030/login?username=foo&password=bar
>
> properties file:
> checkResults=true
> host=myTestServer
> port=8030
> isRandomTest=false
> inputPath=../tests/httptests.xml
> numThreads=1
> shared=false
> readSize=32768
> recordResults=true
> #reporter=XMLReporter
> #outputPath=results.xml
> reporter=CSVReporter
> outputPath=../reports/
> reportTime=true
> reportPercentileDuration=95
> reportStandardDeviation=true
> testTime=0
> testType=HTTP
> testListClass=com.marklogic.performance.XMLFileTestList
>
> Not sure what I'm doing wrong.

------------------------------

Message: 3
Date: Thu, 8 Oct 2009 18:01:18 -0600
From: Curtis Wilde <[email protected]>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <[email protected]>
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

Thanks for the guidance, but changing to URI is still unsuccessful.

Manually requesting authentication with the browser should return:
<response status="AUTHENTICATED"/>
but I still receive
<response status="NOT_AUTHENTICATED"/>
(http://mytestserver:8030/login?username=foo&password=bar)

This is not a problem with the service since currently any username/password
combo will authenticate on our test system.
I'll try to monitor the actual request through a proxy or something and see
if it's getting mangled.

On Thu, Oct 8, 2009 at 4:46 PM, Michael Blakeley <
[email protected]> wrote:

> Curtis,
>
> Try testType=URI instead. The HTTP test type is more specialized: it posts
> the <h:query> value to a special "/evaluate.xqy" service on the target host.
> The idea with that test type is to evaluate arbitrary XQuery expressions.
>
> -- Mike
>
>
> On 2009-10-08 15:06, Curtis Wilde wrote:
>
>> The performance meters tutorial does a good job at explaining how to
>> execute xcc tests with performance meters, but it is less clear how an http
>> test should work. I've taken a stab at a very simple http test with no
>> success:
>>
>> <h:script xmlns:h="http://marklogic.com/xdmp/harness";>
>>     <h:test>
>>         <h:name>login</h:name>
>>         <h:set-up/>
>>         <h:tear-down/>
>>         <h:comment-expected-result><![CDATA[<response
>> status="AUTHENTICATED"/>]]>
>>         </h:comment-expected-result>
>>         <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>>     </h:test>
>> </h:script>
>>
>> The test makes a restful call (login) to a service, which should
>> authenticate the specified user and receive the authenticated status message
>> reply, but this never succeeds. In the address bar of the browser the call
>> looks like:
>>
>> http://myTestServer:8030/login?username=foo&password=bar
>>
>> properties file:
>> checkResults=true
>> host=myTestServer
>> port=8030
>> isRandomTest=false
>> inputPath=../tests/httptests.xml
>> numThreads=1
>> shared=false
>> readSize=32768
>> recordResults=true
>> #reporter=XMLReporter
>> #outputPath=results.xml
>> reporter=CSVReporter
>> outputPath=../reports/
>> reportTime=true
>> reportPercentileDuration=95
>> reportStandardDeviation=true
>> testTime=0
>> testType=HTTP
>> testListClass=com.marklogic.performance.XMLFileTestList
>>
>> Not sure what I'm doing wrong.
>>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/52dfdc33/attachment-0001.html

------------------------------

Message: 4
Date: Thu, 8 Oct 2009 23:16:05 -0700 (PDT)
From: mano m <[email protected]>
Subject: [MarkLogic Dev General] To set threshold for search:search
        results
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi
?
In a search application, we are performing the following steps:
?
1.???? A constant value is set as threshold. From the search response, get the 
total number of results and compare with threshold.
?
2.???? If the search result exceeds the threshold then display the search 
results.
?
3.???? Otherwise?will perform the "Did You Mean?" search (Spell check and auto 
correction using dictionary)?and display the result
?
Please suggest me is there any efficient way to set the threshold instead of 
the constant.
?
Regards,
Mano

      Try the new Yahoo! India Homepage. Click here. http://in.yahoo.com/trynew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/9c8bdc52/attachment-0001.html

------------------------------

Message: 5
Date: Fri, 9 Oct 2009 11:56:36 +0100
From: "Neil Bradley" <[email protected]>
Subject: [MarkLogic Dev General] Text Updates Garbage Collection?
To: <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset="us-ascii"

Hi,

I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.

I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.

It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML

I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.

However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML

Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?

Thanks,

Neil.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20091009/8406b6db/attachment.html

------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

End of General Digest, Vol 64, Issue 25
***************************************
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil Bradley)

Reply via email to