23 mar 2007 kl. 02.12 skrev Melanie Langlois:
I want to manage user subscriptions to specific documents. So I
would like to store the subscription (query) into the lucene
directory, and whenever I receive a new document, I will search all
the matching subscriptions to send the documents to all subcribers.
For instance if a user subscribes to all documents with text
containing (WORD1 and WORD2) or WORD3, how can I match the incoming
document based on stored subscriptions? I was thinking to have two
subfields for each field of the subscription: the AND conditions
and the OR conditions.
-OR. I will tokenized the document field content and insert OR
between each of them, and run the query against OR condition of
subscription
-It's for the AND that I will have an issue, because if the
incoming text may contains more words than the sequence I want to
search.
For instance, if I subscribe for documents contents lucene and java
for instance , if the incoming document contents is lucene is a
great API which has been developed in java, once I removed
stopwords my query would look like lucene and great and API and
developed and java.
As query is composed of more words than the stored subscription I
will fail to retrieve the subscription. But if I put only or words,
the results will not be accurate, as I can obtain subscription only
for java for instance.
I wrote such a thing way back, where I used the new document as the
query and the user subscriptions as the index. Similar to what you
describe, I had an AND, OR and NOT field. This really limited the
type of queries users could store. It does however work, particullary
well on systems with /huge/ amounts of subscriptions (many millions).
Today I would have used something else. If you insert one document at
the time to your index, take a look at MemoryIndex in contrib. If you
insert documents in batches larger than one document at the time,
take a look at LUCENE-550 in the Jira. Add new documents to such an
index and place the subscribed queries on it. Depening on the
queries, the speed should be some 20-100 times faster than using a
RAMDirectory. One million queries should take some 20 seconds to
assemble and place on a 25 document index on my laptop. See <https://
issues.apache.org/jira/secure/attachment/
12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]