Re: [MarkLogic Dev General] Keyword matching strategy

matt.broekhuis Thu, 24 May 2012 14:53:19 -0700

There's no way to fake these fragments being separate docs to the cts:query?


It just makes it easier on me since I will be pushing (overwriting) this single 
doc every day using XCC in a different environment. Otherwise I have to deal 
with the issues of deleting documents that are no longer "valid", etc..

Thanks for all your help again.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Thursday, May 24, 2012 4:49 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Keyword matching strategy

Right, break that up into multiple documents, one per word-query. Otherwise the 
search on reverse-query will merely tell you whether or not the new document 
matches the entire or-query.

-- Mike

On 24 May 2012, at 14:05 , <[email protected]> wrote:

> My other doc look like this. Probably this is what I should be using
> 
> <or-query xmlns="http://marklogic.com/cts";>
>  <word-query>
>    <text>cows</text>
>  </word-query>
>  <word-query>
>    <text>tigers</text>
>  </word-query>
>  <word-query>
>    <text>bears</text>
>  </word-query>
>  <word-query>
>    <text>10 commandments</text>
>  </word-query>
>  <word-query>
>    <text>awesome</text>
>  </word-query>
>  <word-query>
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Broekhuis, Matt
> Sent: Thursday, May 24, 2012 4:04 PM
> To: [email protected]
> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
> 
> If I have one document with all the search terms, how would I do that?
> 
> 
> <keywordMLList xmlns="http://westlegaledcenter.com/MarkLogicSearch";>
>  <keywordML>
>    <keywordId>1</keywordId>
>    <keywordText>cows</keywordText>
>  </keywordML>
>  <keywordML>
>    <keywordId>2</keywordId>
>    <keywordText>horsies</keywordText>
>  </keywordML>
>  <keywordML>
>    <keywordId>3</keywordId>
>    <keywordText>bears</keywordText>
>  </keywordML>
> 
> 
> I tried
> 
> return cts:search(doc('http://someURI/keywordList'), cts:reverse-query(text{ 
> doc('targetDocURI')}))
> 
> 
> 
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Blakeley
> Sent: Thursday, May 24, 2012 3:52 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
> 
> No, with the reverse-query approach you would instead use around 4000 
> separate query documents. This is what I used to generate fake terms for 
> testing:
> 
>    for $i in 1 to 4000
>    return xdmp:document-insert(
>      concat('vocabulary/', $i),
>      document { cts:word-query(xdmp:integer-to-hex($i)) })
> 
> I think you said you have multiple vocabularies? You might use different 
> directory prefixes for different vocabularies. Then you could and-query the 
> reverse-query with a directory-query term.
> 
> -- Mike
> 
> On 24 May 2012, at 13:42 , <[email protected]> wrote:
> 
>> I just got done with the cts walk and its only taking about 3 or 4 seconds. 
>> Our documents are not extremely large. 
>> 
>> I made a giant or query as an xml document, and passed that in. 
>> 
>> I would like to try out the reverse as well. One thing I'm not seeing right 
>> away, do I still need my big OR-query?
>> 
>> Thank you !
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Michael 
>> Blakeley
>> Sent: Thursday, May 24, 2012 2:59 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
>> 
>> The cts:walk can take some time too, simply because the query is so large. 
>> My test took about 30-sec for a 100-kB XML document. This could be capped 
>> using xdmp:elapsed-time and cts:action. I also found that it could be 
>> reduced to about 8-sec by rebuilding the XML in a simpler form:
>> 
>>   element words {
>>     for $w in cts:tokenize($new-document)[. instance of cts:word]
>>     return element word { $w } }
>> 
>> Then I remembered the reverse-query feature. With the fast reverse-query 
>> index enabled, the lookup could be very efficient.
>> 
>> cts:search(
>>   xdmp:directory('vocabulary/', 'infinity'),
>>   cts:reverse-query($new-document))
>> 
>> Without the reverse-query index, this took about 10-sec for my test 
>> document. That can be cut to about 3-sec by using a simplified version of 
>> the document. So it was already faster than cts:walk.
>> 
>> cts:search(
>>   xdmp:directory('vocabulary/', 'infinity'),
>>   cts:reverse-query(text { $new-document }))
>> 
>> Enabling the reverse-query index, both versions were sub-second - in fact, 
>> less than 100-ms, although the text-node version was still 3x faster than 
>> the marked-up version. Anyway I think reverse-query is the most efficient 
>> approach, and enabling fast reverse-query searches makes it very fast.
>> 
>> -- Mike
>> 
>> On 24 May 2012, at 10:40 , Will Thompson wrote:
>> 
>>> Matt,
>>> 
>>> I thought of this solution before I saw Mike's post, but this *would* 
>>> require that the document be inserted first. It leverages the word lexicon, 
>>> so it should be fairly fast, although it still took a while when I tried 
>>> something similar using local content.
>>> 
>>> (for $w in
>>> cts:words((),(),
>>> cts:and-query((             
>>>   cts:document-query($user-doc-uri), 
>>>   cts:word-query((doc('terms.xml')//term/string()))
>>> order by (cts:frequency($w))
>>> retrun $w)[1 to 20]
>>> 
>>> -Will
>>> 
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf 
>>> [email protected]
>>> Sent: Thursday, May 24, 2012 9:05 AM
>>> To: [email protected]
>>> Subject: [MarkLogic Dev General] Keyword matching strategy
>>> 
>>> I have a requirement where the end user would like to add "tags" to 
>>> individual documents.
>>> 
>>> I'm maintaining a separate domain specific list of terms which I suggest to 
>>> the user as potential tags they can select to apply to the document.
>>> 
>>> This list of terms is around 4000 items long. And it will continue to grow.
>>> 
>>> What I want to do ->
>>> 
>>> 1. user creates a document
>>> 2. execute a search against that document with each of these 4000 terms
>>> 3. use results to suggest tags to the user that are already part of the 
>>> document, so they don't have to think of them on their own
>>> 
>>> I tried running search:search 4000 times against the one document. It just 
>>> timed out (which makes sense)
>>> 
>>> I know there has to be a better way to do this. Any suggestions?
>>> 
>>> Thanks!
>>> 
>>> Matt
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://community.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://community.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://community.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://community.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://community.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://community.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Keyword matching strategy

Reply via email to