A couple of thoughts …

Consider using cts:uris (assuming you have a lexicon URI index for your 
content).  This is a lower-level API than search:search and could get you 
better performance.  My guess is that search:search is likley using cts:search 
under the covers.  I don't know for sure as I typically user the lower level 
APIs (such as cts:search, cts:uris, etc.).  Those more familiar with 
search:search can elaborate on whether cts:search is being used by 
search:search.

Continue to use cts:element-value-query (but I would consider breaking the list 
of 100,000 terms into chunks of 10,000 or something a bit more reasonable.  For 
these smaller chunks of work, I would consider spawning them on the task server 
so that they could potentially be done in parallel.  Of course, try 100,000 
first and see if you can meet your performance criteria (< 10s).

One last thought is that you might want to investigate creating a range index 
on ce:pii and use cts:element-range-query.  Not sure if this will be faster 
than cts:element-value-query … But, I seem to recall  that range indexes are 
supposed to be kept in memory.  This is a fairly heavyweight solution as there 
could be implications on your DB sizing and your XML as the ce:pii element 
would need to be unique within your XML document (which is likely not the case) 
and I wouldn't recommend fields in this situation as a workaround.

Darin.



From: Vijayasekar Padmanaban 
<[email protected]<mailto:[email protected]>>
Reply-To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Wed, 20 Jul 2011 13:28:05 +0530
To: General MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Search using 100k terms

Hi Jason,

Sorry for the confusion.

Please find below the snippet of the xml we have in DB. (DB is having 10 
million xml documents)

<ja:item-info>
<ja:jid>YMSG</ja:jid>
<ja:aid>0103883</ja:aid>
<ce:pii>S0011-3840(01)70009-3</ce:pii>
<ce:doi>10.1016/S0011-3840(01)70009-3</ce:doi>
<ce:copyright type="other" year="2001"/>
</ja:item-info>

The file we used to upload will have the PIIs (which I had mentioned as terms 
in my earlier email) as shown below: (There could be 100k PIIs in the file)
S0016-5085(68)70198-0
S0016-5085(68)70199-2
S0016-5085(68)70200-6
S0016-5085(68)70201-8
S0016-5085(68)70202-X
S0016-5085(68)70203-1
S0016-5085(68)70204-3
…..
..…

I need to identify documents that matches the PIIs (which I had mentioned as 
terms in my earlier email) in the file.

Currently we are using search:search() API in our application. Hence I had 
tried using the additional query option of search API as shown below:
cts:element-value-query(xs:QName(“ce:pii”), $uploadedPIIs as xs:string*)

But this additional query option is taking lot of time to yield result.

So is there any other better way to perform this? Please suggest.

Regards,
Vijay

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Jason Hunter
Sent: Wednesday, July 20, 2011 12:32 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search using 100k terms

You say "the term" but you also say you have 300,000 terms.  So I'm confused.

You want to find documents that have all 300,000 terms?

Or for each term you want to find documents having just that term?  And you 
want to do that basic query 300,000 times across all terms in less than 10 
seconds?

-jh-

On Jul 19, 2011, at 11:13 PM, Vijayasekar Padmanaban wrote:


Hi Jason,

Thanks for your response.

My DB is having 10 million documents in it. I need to identify the documents 
which have the term.
I would expect search to retrieve results less than 10 seconds.

Regards,
Vijay

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Jason Hunter
Sent: Wednesday, July 20, 2011 11:33 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search using 100k terms

I'm a little unclear on what you're trying to do.

You want to take a list of 300,000 terms and identify which documents have each 
term?  Or do you only need to identify which terms are present in one or more 
documents and which terms aren't present anywhere?  Something else?

How long are you willing to wait for the answer?

-jh-

On Jul 19, 2011, at 10:45 PM, Vijayasekar Padmanaban wrote:



Hi All,

We have a use case to perform search based on the contents uploaded as a file. 
The file would have a max of 100,000 terms in it. We need to validate the 
contents of the file with our repository contents and produce results. Our 
repository contains 10 million contents. Each term in the file need to be 
validated with an element in the enhanced xml.

Below are the two approached I had tried:
1.       Using search constraints
a.       Each search term would be concatenated with the constraint and would 
be joined using ‘OR’ delimiter as shown below:
For e.g., “const:<term1> OR const:<term2> OR const:<term3> OR const:<term3> OR 
…..”
                                This ended in stack overflow error when the 
number of search terms exceeded 1000
2.       Using element value query
a.       All the search terms would be passed as text to the 
cts:element-value-query as shown below:
cts:element-value-query(<Qualifier-Name>, text as xs:string*)
                                This worked well when DB contains less number 
of contents say 300,000. But when used with DB that has 10 million contents it 
failed saying “Time limit exceeded”

Could you suggest me the best possible approach to resolve this issue?

Thanks,
Vijay


**************** CAUTION - Disclaimer *****************

This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely

for the use of the addressee(s). If you are not the intended recipient, please

notify the sender by e-mail and delete the original message. Further, you are 
not

to copy, disclose, or distribute this e-mail or its contents to any other 
person and

any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken

every reasonable precaution to minimize this risk, but is not liable for any 
damage

you may sustain as a result of any virus in this e-mail. You should carry out 
your

own virus checks before opening the e-mail or attachment. Infosys reserves the

right to monitor and review the content of all messages sent to or from this 
e-mail

address. Messages sent to or from this e-mail address may be stored on the

Infosys e-mail system.

***INFOSYS******** End of Disclaimer ********INFOSYS***

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________ General mailing list 
[email protected]<mailto:[email protected]> 
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to