Somnath Banerjee wrote:
Thanks for the reply. Good guess I think.
DB (Index) is basically a collection of encyclopedia documents. Queries are
also a collection of documents but of various domains. My task is to find
out for each "query document" top 100 matching encyclopedia contents.
I tried by using only the title of (5-8 words) the query documents instead
of full text of the document. But that is also taking 0.5-1 sec for each
query. That's mean it will also take nearly 6 and half days to run
0.72Mqueries (and expectedly the precision will suffer).
Thank you, the problem is a little clearer now. This is a big search
problem, so your biggest obstacle is running it on a tiny computer (from
what you've said, only a fraction of the *queries* fit in your RAM
budget, not to mention the database). I'm confident that spending a
little $$ on your computing resources, particularly RAM, would be FAR,
FAR more cost-effective than programming labor.
But, if you can't get a bigger computer, then you can't. In that case,
I'm not sure that Lucene is the best tool for this problem, at least not
exclusively. You might find that some preprocessing, of the
encyclopedia and query documents, could be used to quickly find the
highest-probability candidates, then use Lucene to score them. I'm
think of a merge-sort kind of algorithm between the query "documents"
and the encyclopedia documents. For each query, find the top several
hundred candidate documents from the encyclopedia, perhaps by number of
matching non-"stop words" (see below). You could also get more
sophisticated by looking at word frequencies from your Lucene index, but
I doubt it would make a huge difference for queries of hundreds of words
that are looking for hundreds of hits.
A couple more suggestions:
- Search the archives for the topic "more documents like this" and
variations on that theme. Several people have used Lucene in this way,
with varying degrees of success.
- If you haven't already, ditch "stop words", like "a", "the",
prepositions, etc. Your index will be smaller and so will your queries,
making each search faster.
Good luck!
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]