Gary,

I don't fully understand how you use your drug thesauri, but my approach would to use your thesauri into an Analyzer. This would allow you during to coerce the various terms to single meanings, somewhat akin to how a stemmer works.

As for size, we're currently using Lucene to index about 100 megs of data, and lookup performance is blinding. Indexing takes a while, but that's as much because of how we calculate the 20+ fields we're indexing on for each Document.

Can you give more specifics on the type of data you'd index, the query you'd want to run, and the desired result of the query?

--
Serge Knystautas
President
Lokitech >> software . strategy . design >> http://www.lokitech.com
p. 301.656.5501
e. [EMAIL PROTECTED]

[EMAIL PROTECTED] wrote:
Caveat:  I have not yet installed Lucerne or begun to experiment with it
yet.  I have scanned the FAQ, but don't see anything that addresses this
question.  Pardon the somewhat slow buildup to the question below, but I
want to set the context.

I am developing an application for 'text mining' adverse event reports in
the pharmaceutical industry.  The querying will be driven by
'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
favorite) of drug names, compounds, and medical conditions.  These thesauri
are quite large.  For example, our drug name thesaurus is on the order of
60,000+ terms.

I was planning on using Verity to accomplish my first approach to shallow
text mining since Verity is our corproate-wide search engine technology and
it supports a number of relevant features (including 'topic sets' for
representing the taxonomies).  However, Verity imposes restrictions on the
size of topic sets that currently prohibit me from using it with our large
taxonomies.  It is not obvious that they will be able to fix this problem
in the timeframe I need.  Thus I am turning to other alternatives, and
Lucerne appears to be one.

So given that context, my question is this:  Does anyone on this list have
experience attempting to use very large queries (potentially thousands or
tens of thousands of terms) in Lucerne?  Does anyone have any knowledge of
design or implementation details that would inhibit the use of such
queries?  Does anyone have any idea of what the performance would be like
in retrieving via such queries?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to