Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when you build your graph, only the time to run the queries? Because, I should add the time for creating the index anytime a new document comes in (or a subset of documents if several comes in same time), and the indexing of these documents. The documents should not be big, around 2KB. Did you measure this part ?
Mélanie -----Original Message----- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 10:35 AM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 02.12 skrev Melanie Langlois: > I want to manage user subscriptions to specific documents. So I > would like to store the subscription (query) into the lucene > directory, and whenever I receive a new document, I will search all > the matching subscriptions to send the documents to all subcribers. > For instance if a user subscribes to all documents with text > containing (WORD1 and WORD2) or WORD3, how can I match the incoming > document based on stored subscriptions? I was thinking to have two > subfields for each field of the subscription: the AND conditions > and the OR conditions. > > -OR. I will tokenized the document field content and insert OR > between each of them, and run the query against OR condition of > subscription > > -It's for the AND that I will have an issue, because if the > incoming text may contains more words than the sequence I want to > search. > > For instance, if I subscribe for documents contents lucene and java > for instance , if the incoming document contents is lucene is a > great API which has been developed in java, once I removed > stopwords my query would look like lucene and great and API and > developed and java. > > As query is composed of more words than the stored subscription I > will fail to retrieve the subscription. But if I put only or words, > the results will not be accurate, as I can obtain subscription only > for java for instance. > I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions). Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See <https:// issues.apache.org/jira/secure/attachment/ 12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550. -- karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]