Re: [MarkLogic Dev General] Keyword matching strategy

Michael Blakeley Thu, 24 May 2012 15:25:19 -0700

I think it would be cleaner to use a directory prefix in the document URIs with 
the date encoded in it, something like 'vocabulary/2012-05-24/'. You can insert 
the new queries into today's directory, run any tests, switch the production 
configuration to today's directory, and finally xdmp:directory-delete the old 
ones. You might even leave the old ones around for a day or two, in case you 
spot problems and need to roll back.


But I was curious about fragments so I looked into it. First, 
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/search-dev-guide/alerts.xml
 does not say anything about fragments. So the behavior is undocumented and 
might not be reliable, especially across releases. But this works for me with a 
fragment root on 'vocabulary-term'. For production use I think a namespace 
would be a good idea. I would *not* try to fragment on cts:word-query, since 
that is very likely to cause problems down the road.

xdmp:document-insert(
  'vocabulary/1',
  element vocabulary {
    for $i in 1 to 4000
    return element vocabulary-term {
      cts:word-query(xdmp:integer-to-hex($i)) }})
=> ()

cts:search(
  xdmp:directory('vocabulary/', 'infinity')//vocabulary-term,
  cts:reverse-query(text { 'caf' } ))  
=>
<vocabulary-term>
  <cts:word-query xmlns:cts="http://marklogic.com/cts";>
    <cts:text xml:lang="en">caf</cts:text>
  </cts:word-query>
</vocabulary-term>

As far as I can tell from the query-meters output, this uses the reverse-query 
index. Again, the docs aren't clear on whether or not this should work, so it 
may not be supported for production use. I would check with support before 
relying on it.

This approach may also be a shade slower than the multidocument approach. If 
you don't have the reverse-query index, it could even be slower than cts:walk.

-- Mike

On 24 May 2012, at 14:53 , <[email protected]> wrote:

> There's no way to fake these fragments being separate docs to the cts:query?
> 
> It just makes it easier on me since I will be pushing (overwriting) this 
> single doc every day using XCC in a different environment. Otherwise I have 
> to deal with the issues of deleting documents that are no longer "valid", 
> etc..
> 
> Thanks for all your help again.
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Blakeley
> Sent: Thursday, May 24, 2012 4:49 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
> 
> Right, break that up into multiple documents, one per word-query. Otherwise 
> the search on reverse-query will merely tell you whether or not the new 
> document matches the entire or-query.
> 
> -- Mike
> 
> On 24 May 2012, at 14:05 , <[email protected]> wrote:
> 
>> My other doc look like this. Probably this is what I should be using
>> 
>> <or-query xmlns="http://marklogic.com/cts";>
>> <word-query>
>>   <text>cows</text>
>> </word-query>
>> <word-query>
>>   <text>tigers</text>
>> </word-query>
>> <word-query>
>>   <text>bears</text>
>> </word-query>
>> <word-query>
>>   <text>10 commandments</text>
>> </word-query>
>> <word-query>
>>   <text>awesome</text>
>> </word-query>
>> <word-query>
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Broekhuis, Matt
>> Sent: Thursday, May 24, 2012 4:04 PM
>> To: [email protected]
>> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
>> 
>> If I have one document with all the search terms, how would I do that?
>> 
>> 
>> <keywordMLList xmlns="http://westlegaledcenter.com/MarkLogicSearch";>
>> <keywordML>
>>   <keywordId>1</keywordId>
>>   <keywordText>cows</keywordText>
>> </keywordML>
>> <keywordML>
>>   <keywordId>2</keywordId>
>>   <keywordText>horsies</keywordText>
>> </keywordML>
>> <keywordML>
>>   <keywordId>3</keywordId>
>>   <keywordText>bears</keywordText>
>> </keywordML>
>> 
>> 
>> I tried
>> 
>> return cts:search(doc('http://someURI/keywordList'), cts:reverse-query(text{ 
>> doc('targetDocURI')}))
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Michael 
>> Blakeley
>> Sent: Thursday, May 24, 2012 3:52 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
>> 
>> No, with the reverse-query approach you would instead use around 4000 
>> separate query documents. This is what I used to generate fake terms for 
>> testing:
>> 
>>   for $i in 1 to 4000
>>   return xdmp:document-insert(
>>     concat('vocabulary/', $i),
>>     document { cts:word-query(xdmp:integer-to-hex($i)) })
>> 
>> I think you said you have multiple vocabularies? You might use different 
>> directory prefixes for different vocabularies. Then you could and-query the 
>> reverse-query with a directory-query term.
>> 
>> -- Mike
>> 
>> On 24 May 2012, at 13:42 , <[email protected]> wrote:
>> 
>>> I just got done with the cts walk and its only taking about 3 or 4 seconds. 
>>> Our documents are not extremely large. 
>>> 
>>> I made a giant or query as an xml document, and passed that in. 
>>> 
>>> I would like to try out the reverse as well. One thing I'm not seeing right 
>>> away, do I still need my big OR-query?
>>> 
>>> Thank you !
>>> 
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Michael 
>>> Blakeley
>>> Sent: Thursday, May 24, 2012 2:59 PM
>>> To: MarkLogic Developer Discussion
>>> Subject: Re: [MarkLogic Dev General] Keyword matching strategy
>>> 
>>> The cts:walk can take some time too, simply because the query is so large. 
>>> My test took about 30-sec for a 100-kB XML document. This could be capped 
>>> using xdmp:elapsed-time and cts:action. I also found that it could be 
>>> reduced to about 8-sec by rebuilding the XML in a simpler form:
>>> 
>>>  element words {
>>>    for $w in cts:tokenize($new-document)[. instance of cts:word]
>>>    return element word { $w } }
>>> 
>>> Then I remembered the reverse-query feature. With the fast reverse-query 
>>> index enabled, the lookup could be very efficient.
>>> 
>>> cts:search(
>>>  xdmp:directory('vocabulary/', 'infinity'),
>>>  cts:reverse-query($new-document))
>>> 
>>> Without the reverse-query index, this took about 10-sec for my test 
>>> document. That can be cut to about 3-sec by using a simplified version of 
>>> the document. So it was already faster than cts:walk.
>>> 
>>> cts:search(
>>>  xdmp:directory('vocabulary/', 'infinity'),
>>>  cts:reverse-query(text { $new-document }))
>>> 
>>> Enabling the reverse-query index, both versions were sub-second - in fact, 
>>> less than 100-ms, although the text-node version was still 3x faster than 
>>> the marked-up version. Anyway I think reverse-query is the most efficient 
>>> approach, and enabling fast reverse-query searches makes it very fast.
>>> 
>>> -- Mike
>>> 
>>> On 24 May 2012, at 10:40 , Will Thompson wrote:
>>> 
>>>> Matt,
>>>> 
>>>> I thought of this solution before I saw Mike's post, but this *would* 
>>>> require that the document be inserted first. It leverages the word 
>>>> lexicon, so it should be fairly fast, although it still took a while when 
>>>> I tried something similar using local content.
>>>> 
>>>> (for $w in
>>>> cts:words((),(),
>>>> cts:and-query((             
>>>>  cts:document-query($user-doc-uri), 
>>>>  cts:word-query((doc('terms.xml')//term/string()))
>>>> order by (cts:frequency($w))
>>>> retrun $w)[1 to 20]
>>>> 
>>>> -Will
>>>> 
>>>> From: [email protected] 
>>>> [mailto:[email protected]] On Behalf 
>>>> [email protected]
>>>> Sent: Thursday, May 24, 2012 9:05 AM
>>>> To: [email protected]
>>>> Subject: [MarkLogic Dev General] Keyword matching strategy
>>>> 
>>>> I have a requirement where the end user would like to add "tags" to 
>>>> individual documents.
>>>> 
>>>> I'm maintaining a separate domain specific list of terms which I suggest 
>>>> to the user as potential tags they can select to apply to the document.
>>>> 
>>>> This list of terms is around 4000 items long. And it will continue to grow.
>>>> 
>>>> What I want to do ->
>>>> 
>>>> 1. user creates a document
>>>> 2. execute a search against that document with each of these 4000 terms
>>>> 3. use results to suggest tags to the user that are already part of the 
>>>> document, so they don't have to think of them on their own
>>>> 
>>>> I tried running search:search 4000 times against the one document. It just 
>>>> timed out (which makes sense)
>>>> 
>>>> I know there has to be a better way to do this. Any suggestions?
>>>> 
>>>> Thanks!
>>>> 
>>>> Matt
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://community.marklogic.com/mailman/listinfo/general
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://community.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://community.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://community.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://community.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://community.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://community.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://community.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Keyword matching strategy

Reply via email to