RE: Reverse search

Melanie Langlois Thu, 22 Mar 2007 18:08:06 -0800

Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be faster, but 
sounds nice if I can use this, make everything easier to manage. I'm just 
wondering what did you consider when you build your graph, only the time to run 
the queries? Because, I should add the time for creating the index anytime a 
new document comes in (or a subset of documents if several comes in same time), 
and the indexing of these documents. The documents should not be big, around 
2KB. Did you measure this part ?

Mélanie 

-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 10:35 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search

23 mar 2007 kl. 02.12 skrev Melanie Langlois:

> I want to manage user subscriptions to specific documents. So I  
> would like to store the subscription (query) into the lucene  
> directory, and whenever I receive a new document, I will search all  
> the matching subscriptions to send the documents to all subcribers.  
> For instance if a user subscribes to all documents with text  
> containing (WORD1 and WORD2) or WORD3, how can I match the incoming  
> document based on stored subscriptions? I was thinking to have two  
> subfields for each field of the subscription: the AND conditions  
> and the OR conditions.
>
> -OR. I will tokenized the document field content and insert OR  
> between each of them, and run the query against OR condition of  
> subscription
>
> -It's for the AND that I will have an issue, because if the  
> incoming text may contains more words than the sequence I want to  
> search.
>
> For instance, if I subscribe for documents contents lucene and java  
> for instance , if the incoming document contents is lucene is a  
> great API which has been developed in java, once I removed  
> stopwords my query would look like lucene and great and API and  
> developed and java.
>
> As query is composed of more words than the stored subscription I  
> will fail to retrieve the subscription. But if I put only or words,  
> the results will not be accurate, as I can obtain subscription only  
> for java for instance.
>

I wrote such a thing way back, where I used the new document as the  
query and the user subscriptions as the index. Similar to what you  
describe, I had an AND, OR and NOT field. This really limited the  
type of queries users could store. It does however work, particullary  
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at  
the time to your index, take a look at MemoryIndex in contrib. If you  
insert documents in batches larger than one document at the time,  
take a look at LUCENE-550 in the Jira. Add new documents to such an  
index and place the subscribed queries on it. Depening on the  
queries, the speed should be some 20-100 times faster than using a  
RAMDirectory. One million queries should take some 20 seconds to  
assemble and place on a 25 document index on my laptop. See <https:// 
issues.apache.org/jira/secure/attachment/ 
12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Reverse search

Reply via email to