I don't fully understand how you use your drug thesauri, but my approach would to use your thesauri into an Analyzer. This would allow you during to coerce the various terms to single meanings, somewhat akin to how a stemmer works.
As for size, we're currently using Lucene to index about 100 megs of data, and lookup performance is blinding. Indexing takes a while, but that's as much because of how we calculate the 20+ fields we're indexing on for each Document.
Can you give more specifics on the type of data you'd index, the query you'd want to run, and the desired result of the query?
-- Serge Knystautas President Lokitech >> software . strategy . design >> http://www.lokitech.com p. 301.656.5501 e. [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
Caveat: I have not yet installed Lucerne or begun to experiment with it yet. I have scanned the FAQ, but don't see anything that addresses this question. Pardon the somewhat slow buildup to the question below, but I want to set the context.
I am developing an application for 'text mining' adverse event reports in the pharmaceutical industry. The querying will be driven by 'dictionaries', 'thesauri', 'taxonomies' or 'ontologies' (pick your favorite) of drug names, compounds, and medical conditions. These thesauri are quite large. For example, our drug name thesaurus is on the order of 60,000+ terms.
I was planning on using Verity to accomplish my first approach to shallow text mining since Verity is our corproate-wide search engine technology and it supports a number of relevant features (including 'topic sets' for representing the taxonomies). However, Verity imposes restrictions on the size of topic sets that currently prohibit me from using it with our large taxonomies. It is not obvious that they will be able to fix this problem in the timeframe I need. Thus I am turning to other alternatives, and Lucerne appears to be one.
So given that context, my question is this: Does anyone on this list have experience attempting to use very large queries (potentially thousands or tens of thousands of terms) in Lucerne? Does anyone have any knowledge of design or implementation details that would inhibit the use of such queries? Does anyone have any idea of what the performance would be like in retrieving via such queries?
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
