As always, Darin has lots of great ideas, I recommend trying them. Given the nature of your data, though, I would try using the "exact" option to cts:element-value-query.
I think the range index idea is really good, and then you can use a range-query (cts:element-range-query) either with search:search or cts:uris. I am not sure I would classify it as a "heavyweight" solution...it will use a little more memory, but will be well worth it, I suspect. Let us know how it goes. -Danny -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of McBeath, Darin W (ELS-STL) Sent: Wednesday, July 20, 2011 7:28 AM To: General MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Search using 100k terms I would also make sure that on the options for cts:element-value-query that you have indicated 'punctuation-insensitive' and 'case-insensitive'. I'm assuming that punctuation/case will not matter in your situation. On 7/20/11 9:54 AM, "McBeath, Darin W (ELS-STL)" <[email protected]> wrote: >A couple of thoughts Š > >Consider using cts:uris (assuming you have a lexicon URI index for your >content). This is a lower-level API than search:search and could get you >better performance. My guess is that search:search is likley using >cts:search under the covers. I don't know for sure as I typically user >the lower level APIs (such as cts:search, cts:uris, etc.). Those more >familiar with search:search can elaborate on whether cts:search is being >used by search:search. > >Continue to use cts:element-value-query (but I would consider breaking >the list of 100,000 terms into chunks of 10,000 or something a bit more >reasonable. For these smaller chunks of work, I would consider spawning >them on the task server so that they could potentially be done in >parallel. Of course, try 100,000 first and see if you can meet your >performance criteria (< 10s). > >One last thought is that you might want to investigate creating a range >index on ce:pii and use cts:element-range-query. Not sure if this will >be faster than cts:element-value-query Š But, I seem to recall that >range indexes are supposed to be kept in memory. This is a fairly >heavyweight solution as there could be implications on your DB sizing and >your XML as the ce:pii element would need to be unique within your XML >document (which is likely not the case) and I wouldn't recommend fields >in this situation as a workaround. > >Darin. > > > >From: Vijayasekar Padmanaban ><[email protected]<mailto:[email protected]>> >Reply-To: General MarkLogic Developer Discussion ><[email protected]<mailto:[email protected]>> >Date: Wed, 20 Jul 2011 13:28:05 +0530 >To: General MarkLogic Developer Discussion ><[email protected]<mailto:[email protected]>> >Subject: Re: [MarkLogic Dev General] Search using 100k terms > >Hi Jason, > >Sorry for the confusion. > >Please find below the snippet of the xml we have in DB. (DB is having 10 >million xml documents) > ><ja:item-info> ><ja:jid>YMSG</ja:jid> ><ja:aid>0103883</ja:aid> ><ce:pii>S0011-3840(01)70009-3</ce:pii> ><ce:doi>10.1016/S0011-3840(01)70009-3</ce:doi> ><ce:copyright type="other" year="2001"/> ></ja:item-info> > >The file we used to upload will have the PIIs (which I had mentioned as >terms in my earlier email) as shown below: (There could be 100k PIIs in >the file) >S0016-5085(68)70198-0 >S0016-5085(68)70199-2 >S0016-5085(68)70200-6 >S0016-5085(68)70201-8 >S0016-5085(68)70202-X >S0016-5085(68)70203-1 >S0016-5085(68)70204-3 >Š.. >..Š > >I need to identify documents that matches the PIIs (which I had mentioned >as terms in my earlier email) in the file. > >Currently we are using search:search() API in our application. Hence I >had tried using the additional query option of search API as shown below: >cts:element-value-query(xs:QName(³ce:pii²), $uploadedPIIs as xs:string*) > >But this additional query option is taking lot of time to yield result. > >So is there any other better way to perform this? Please suggest. > >Regards, >Vijay > >From: >[email protected]<mailto:[email protected] >arklogic.com> [mailto:[email protected]] On Behalf >Of Jason Hunter >Sent: Wednesday, July 20, 2011 12:32 PM >To: General MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Search using 100k terms > >You say "the term" but you also say you have 300,000 terms. So I'm >confused. > >You want to find documents that have all 300,000 terms? > >Or for each term you want to find documents having just that term? And >you want to do that basic query 300,000 times across all terms in less >than 10 seconds? > >-jh- > >On Jul 19, 2011, at 11:13 PM, Vijayasekar Padmanaban wrote: > > >Hi Jason, > >Thanks for your response. > >My DB is having 10 million documents in it. I need to identify the >documents which have the term. >I would expect search to retrieve results less than 10 seconds. > >Regards, >Vijay > >From: >[email protected]<mailto:[email protected] >arklogic.com> [mailto:[email protected]] On Behalf >Of Jason Hunter >Sent: Wednesday, July 20, 2011 11:33 AM >To: General MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Search using 100k terms > >I'm a little unclear on what you're trying to do. > >You want to take a list of 300,000 terms and identify which documents >have each term? Or do you only need to identify which terms are present >in one or more documents and which terms aren't present anywhere? >Something else? > >How long are you willing to wait for the answer? > >-jh- > >On Jul 19, 2011, at 10:45 PM, Vijayasekar Padmanaban wrote: > > > >Hi All, > >We have a use case to perform search based on the contents uploaded as a >file. The file would have a max of 100,000 terms in it. We need to >validate the contents of the file with our repository contents and >produce results. Our repository contains 10 million contents. Each term >in the file need to be validated with an element in the enhanced xml. > >Below are the two approached I had tried: >1. Using search constraints >a. Each search term would be concatenated with the constraint and >would be joined using ŒOR¹ delimiter as shown below: >For e.g., ³const:<term1> OR const:<term2> OR const:<term3> OR >const:<term3> OR Š..² > This ended in stack overflow error when >the number of search terms exceeded 1000 >2. Using element value query >a. All the search terms would be passed as text to the >cts:element-value-query as shown below: >cts:element-value-query(<Qualifier-Name>, text as xs:string*) > This worked well when DB contains less >number of contents say 300,000. But when used with DB that has 10 million >contents it failed saying ³Time limit exceeded² > >Could you suggest me the best possible approach to resolve this issue? > >Thanks, >Vijay > > >**************** CAUTION - Disclaimer ***************** > >This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended >solely > >for the use of the addressee(s). If you are not the intended recipient, >please > >notify the sender by e-mail and delete the original message. Further, you >are not > >to copy, disclose, or distribute this e-mail or its contents to any other >person and > >any such actions are unlawful. This e-mail may contain viruses. Infosys >has taken > >every reasonable precaution to minimize this risk, but is not liable for >any damage > >you may sustain as a result of any virus in this e-mail. You should carry >out your > >own virus checks before opening the e-mail or attachment. Infosys >reserves the > >right to monitor and review the content of all messages sent to or from >this e-mail > >address. Messages sent to or from this e-mail address may be stored on the > >Infosys e-mail system. > >***INFOSYS******** End of Disclaimer ********INFOSYS*** > >_______________________________________________ >General mailing list >[email protected]<mailto:[email protected]> >http://developer.marklogic.com/mailman/listinfo/general > >_______________________________________________ >General mailing list >[email protected]<mailto:[email protected]> >http://developer.marklogic.com/mailman/listinfo/general > >_______________________________________________ General mailing list >[email protected]<mailto:[email protected]> >http://developer.marklogic.com/mailman/listinfo/general >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
