RE: [MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil Bradley)

Neil Bradley Fri, 09 Oct 2009 03:07:50 -0700

Kelly,

Does that approach work with text documents?

Another issue is that, for reasons I do not want to expand on here, we want
to process one document at a time through the step discussed here along with
other prior and following steps, so I am not sure the benefits of this
approach over the fn:replace() function. But it is certainly a interesting
alternative. 

Neil.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Kelly Stirman
Sent: 09 October 2009 10:57
To: [email protected]
Subject: [MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil
Bradley)

HI Neil,

Have you thought about using cts:highlight() to do the replacing of your
string values? You basically construct a cts:or-query(()) of all the
different values you'd like to replace:

let $q := cts:or-query(("Doc","ume","nt"))

Then you call cts:highlight() on the document. Normally you would use
cts:highlight() to replace a matching string with some new markup for style,
such as a span tag. It turns out you can use it to replace the matching
string with whatever you want. Where cts:highlight() finds a match, you have
some useful options. One is the $cts:queries variable, which returns the
matching query for the text that is matched. You can use this with a lookup
document like so:

<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

For each match, you'll get back a cts:query, and you can use this to find
matches in your replace node, and use the substitution string as the value
for the third argument in cts:highlight():

let $doc :=
<doc>I have some text that includes the words Doc, ume, and nt.</doc>

let $replace :=
<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

let $q := cts:or-query(("Doc","ume","nt"))

return
cts:highlight($doc,$q,local:replace($cts:queries,$replace))

-->

<doc>I have some text that includes the words DOC, UME, and NT.</doc>

This can be extended with cts:reverse-query() to perform custom enrichment
on XML. Rather than having one large or-query() for all the strings you
might want to replace, you would store a document with your query and any
other useful metadata you wish to associate with the query. For example, if
you wanted to do some custom enrichment on drug names, you might have a
series of documents like this:

<drug>
  <name type="commercial">Tylenol</name>
  <img type="commercial">/Thumbs/generic/acetamenophin.png</img>
  <name type="generic">Acetamenophin</name>
  <img type="generic">/Thumbs/generic/acetamenophin.png</img>
  <link>http://drugdictionary.com/drugid/j674ui832190</link>

<query>{cts:or-query((cts:word-query("Tylenol","case-insensitive"),cts:word-
query("Acetamenophin","case-insensitive")))}</query>
</drug>

And for each document you want to enrich, you would use the reverse indexes
to see which drugs are in the document. This is a much easier approach to
manage than an or-query() of thousands of drug names:

cts:search(doc(),cts:reverse-query($new-document))

This would return the matching query documents, and you can then retrieve
the queries from these docs and pass them to cts:highlight(). Here's how you
might do that:

let $drug-groups := cts:search(doc(),cts:reverse-query($doc))
let $query := cts:or-query((cts:query($drug-groups/drug/query/*)))
return
  cts:highlight($doc,$query,local:drug-enrich($cts:queries,$drug-groups))

In this case, instead of a single replace document, the new value is one of
several pieces of metadata you store with each query. You can write your own
function to build elaborate replacement markup. Here's a simple example for
the drugs:

declare function local:drug-enrich($query as cts:query,$drug-groups as
node()*){
  let $this-drug := $drug-groups/drug/name[cts:contains(.,$query)]
  let $this-type := fn:data($this-drug/@type)
  let $other-type := if($this-type eq "commercial") then "generic" else
"commercial"
  let $img := fn:data($this-drug/@img)
  let $link := $this-drug/../link/text()
  let $equivalent := $this-drug/../na...@type eq $other-type]/text()
  return <drug img="{$img}" link="{$link}">{$match} [{$equivalent}]</drug>
};

Kelly

Hi,

I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.

I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.

It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML

I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.

However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML

Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?

Thanks,

Neil.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
[email protected]
Sent: Friday, October 09, 2009 2:27 AM
To: [email protected]
Subject: General Digest, Vol 64, Issue 25

Send General mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."

Today's Topics:

   1. Performance Meters http test configuration (Curtis Wilde)
   2. Re: Performance Meters http test configuration (Michael Blakeley)
   3. Re: Performance Meters http test  configuration (Curtis Wilde)
   4. To set threshold for search:search results (mano m)
   5. Text Updates Garbage Collection? (Neil Bradley)

----------------------------------------------------------------------

Message: 1
Date: Thu, 8 Oct 2009 16:06:24 -0600
From: Curtis Wilde <[email protected]>
Subject: [MarkLogic Dev General] Performance Meters http test
        configuration
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

The performance meters tutorial does a good job at explaining how to execute
xcc tests with performance meters, but it is less clear how an http test
should work. I've taken a stab at a very simple http test with no success:

<h:script xmlns:h="http://marklogic.com/xdmp/harness";>
    <h:test>
        <h:name>login</h:name>
        <h:set-up/>
        <h:tear-down/>
        <h:comment-expected-result><![CDATA[<response
status="AUTHENTICATED"/>]]>
        </h:comment-expected-result>
        <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
    </h:test>
</h:script>

The test makes a restful call (login) to a service, which should
authenticate the specified user and receive the authenticated status message
reply, but this never succeeds. In the address bar of the browser the call
looks like:

http://myTestServer:8030/login?username=foo&password=bar

properties file:
checkResults=true
host=myTestServer
port=8030
isRandomTest=false
inputPath=../tests/httptests.xml
numThreads=1
shared=false
readSize=32768
recordResults=true
#reporter=XMLReporter
#outputPath=results.xml
reporter=CSVReporter
outputPath=../reports/
reportTime=true
reportPercentileDuration=95
reportStandardDeviation=true
testTime=0
testType=HTTP
testListClass=com.marklogic.performance.XMLFileTestList

Not sure what I'm doing wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/c2c8a698/
attachment-0001.html

------------------------------

Message: 2
Date: Thu, 08 Oct 2009 15:46:25 -0700
From: Michael Blakeley <[email protected]>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

Curtis,

Try testType=URI instead. The HTTP test type is more specialized: it
posts the <h:query> value to a special "/evaluate.xqy" service on the
target host. The idea with that test type is to evaluate arbitrary
XQuery expressions.

-- Mike

On 2009-10-08 15:06, Curtis Wilde wrote:
> The performance meters tutorial does a good job at explaining how to
execute xcc tests with performance meters, but it is less clear how an http
test should work. I've taken a stab at a very simple http test with no
success:
>
> <h:script xmlns:h="http://marklogic.com/xdmp/harness";>
>      <h:test>
>          <h:name>login</h:name>
>          <h:set-up/>
>          <h:tear-down/>
>          <h:comment-expected-result><![CDATA[<response
status="AUTHENTICATED"/>]]>
>          </h:comment-expected-result>
>          <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>      </h:test>
> </h:script>
>
> The test makes a restful call (login) to a service, which should
authenticate the specified user and receive the authenticated status message
reply, but this never succeeds. In the address bar of the browser the call
looks like:
>
> http://myTestServer:8030/login?username=foo&password=bar
>
> properties file:
> checkResults=true
> host=myTestServer
> port=8030
> isRandomTest=false
> inputPath=../tests/httptests.xml
> numThreads=1
> shared=false
> readSize=32768
> recordResults=true
> #reporter=XMLReporter
> #outputPath=results.xml
> reporter=CSVReporter
> outputPath=../reports/
> reportTime=true
> reportPercentileDuration=95
> reportStandardDeviation=true
> testTime=0
> testType=HTTP
> testListClass=com.marklogic.performance.XMLFileTestList
>
> Not sure what I'm doing wrong.

------------------------------

Message: 3
Date: Thu, 8 Oct 2009 18:01:18 -0600
From: Curtis Wilde <[email protected]>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <[email protected]>
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

Thanks for the guidance, but changing to URI is still unsuccessful.

Manually requesting authentication with the browser should return:
<response status="AUTHENTICATED"/>
but I still receive
<response status="NOT_AUTHENTICATED"/>
(http://mytestserver:8030/login?username=foo&password=bar)

This is not a problem with the service since currently any username/password
combo will authenticate on our test system.
I'll try to monitor the actual request through a proxy or something and see
if it's getting mangled.

On Thu, Oct 8, 2009 at 4:46 PM, Michael Blakeley <
[email protected]> wrote:

> Curtis,
>
> Try testType=URI instead. The HTTP test type is more specialized: it posts
> the <h:query> value to a special "/evaluate.xqy" service on the target
host.
> The idea with that test type is to evaluate arbitrary XQuery expressions.
>
> -- Mike
>
>
> On 2009-10-08 15:06, Curtis Wilde wrote:
>
>> The performance meters tutorial does a good job at explaining how to
>> execute xcc tests with performance meters, but it is less clear how an
http
>> test should work. I've taken a stab at a very simple http test with no
>> success:
>>
>> <h:script xmlns:h="http://marklogic.com/xdmp/harness";>
>>     <h:test>
>>         <h:name>login</h:name>
>>         <h:set-up/>
>>         <h:tear-down/>
>>         <h:comment-expected-result><![CDATA[<response
>> status="AUTHENTICATED"/>]]>
>>         </h:comment-expected-result>
>>         <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>>     </h:test>
>> </h:script>
>>
>> The test makes a restful call (login) to a service, which should
>> authenticate the specified user and receive the authenticated status
message
>> reply, but this never succeeds. In the address bar of the browser the
call
>> looks like:
>>
>> http://myTestServer:8030/login?username=foo&password=bar
>>
>> properties file:
>> checkResults=true
>> host=myTestServer
>> port=8030
>> isRandomTest=false
>> inputPath=../tests/httptests.xml
>> numThreads=1
>> shared=false
>> readSize=32768
>> recordResults=true
>> #reporter=XMLReporter
>> #outputPath=results.xml
>> reporter=CSVReporter
>> outputPath=../reports/
>> reportTime=true
>> reportPercentileDuration=95
>> reportStandardDeviation=true
>> testTime=0
>> testType=HTTP
>> testListClass=com.marklogic.performance.XMLFileTestList
>>
>> Not sure what I'm doing wrong.
>>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/52dfdc33/
attachment-0001.html

------------------------------

Message: 4
Date: Thu, 8 Oct 2009 23:16:05 -0700 (PDT)
From: mano m <[email protected]>
Subject: [MarkLogic Dev General] To set threshold for search:search
        results
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi
?
In a search application, we are performing the following steps:
?
1.???? A constant value is set as threshold. From the search response, get
the total number of results and compare with threshold.
?
2.???? If the search result exceeds the threshold then display the search
results.
?
3.???? Otherwise?will perform the "Did You Mean?" search (Spell check and
auto correction using dictionary)?and display the result
?
Please suggest me is there any efficient way to set the threshold instead of
the constant.
?
Regards,
Mano

      Try the new Yahoo! India Homepage. Click here.
http://in.yahoo.com/trynew
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://xqzone.marklogic.com/pipermail/general/attachments/20091008/9c8bdc52/
attachment-0001.html

------------------------------

Message: 5
Date: Fri, 9 Oct 2009 11:56:36 +0100
From: "Neil Bradley" <[email protected]>
Subject: [MarkLogic Dev General] Text Updates Garbage Collection?
To: <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset="us-ascii"

Hi,

I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.

I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.

It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML

I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.

However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...

    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML

Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?

Thanks,

Neil.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://xqzone.marklogic.com/pipermail/general/attachments/20091009/8406b6db/
attachment.html

------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

End of General Digest, Vol 64, Issue 25
***************************************
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil Bradley)

Reply via email to