Re: Reverse search

karl wettin Thu, 22 Mar 2007 17:35:44 -0800


23 mar 2007 kl. 02.12 skrev Melanie Langlois:

I want to manage user subscriptions to specific documents. So Iwould like to store the subscription (query) into the lucenedirectory, and whenever I receive a new document, I will search allthe matching subscriptions to send the documents to all subcribers.For instance if a user subscribes to all documents with textcontaining (WORD1 and WORD2) or WORD3, how can I match the incomingdocument based on stored subscriptions? I was thinking to have twosubfields for each field of the subscription: the AND conditionsand the OR conditions.
-OR. I will tokenized the document field content and insert ORbetween each of them, and run the query against OR condition ofsubscription
-It's for the AND that I will have an issue, because if theincoming text may contains more words than the sequence I want tosearch.
For instance, if I subscribe for documents contents lucene and javafor instance , if the incoming document contents is lucene is agreat API which has been developed in java, once I removedstopwords my query would look like lucene and great and API anddeveloped and java.
As query is composed of more words than the stored subscription Iwill fail to retrieve the subscription. But if I put only or words,the results will not be accurate, as I can obtain subscription onlyfor java for instance.

I wrote such a thing way back, where I used the new document as thequery and the user subscriptions as the index. Similar to what youdescribe, I had an AND, OR and NOT field. This really limited thetype of queries users could store. It does however work, particullarywell on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document atthe time to your index, take a look at MemoryIndex in contrib. If youinsert documents in batches larger than one document at the time,take a look at LUCENE-550 in the Jira. Add new documents to such anindex and place the subscribed queries on it. Depening on thequeries, the speed should be some 20-100 times faster than using aRAMDirectory. One million queries should take some 20 seconds toassemble and place on a 25 document index on my laptop. See <https://issues.apache.org/jira/secure/attachment/12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reverse search

Reply via email to