Let me describe what the goal is and how I could use Verity to accompish
this -- provided that Verity did not impose the limits it does.

The documents being indexed are small, completely unstructured, textual
descriptions of adverse events involving one or more drugs, one or more
medical conditions, and potentially other relevant and irrelevant
information.  By "small" I mean that they are typically on the order of
several sentences in length.

Assume that the initial goal is to identify pairwise associations of drugs
and conditions in such documents.  Moreover, we would want not only to
identify drug/condition pairs, but more broadly to identify
type-of-drug/type-of-condition pairs.  For example, the set of adverse
event reports might not contain a significant number of reports about a
*specific* drug and a *specific* condition -- such as (just as an example)
'aspirin' and 'blood pressure' -- but it might contain a significant number
of reports about a particular *class* of drugs (therapeutic class or
pharmacological class) and a particular *class* of conditions -- such as
'beta-blockers' and 'cardiac events'.

Viewed as an information retrieval problem (not the best way to view it,
but this is just an initial approach), one could then (1) create a taxonomy
of drugs and a taxonomy of conditions, and (2) implement a concept-oriented
(taxonomy-oriented) search of the corpus for something like:
                    {beta_blocker} AND {cardiac_condition}
where '{beat_blocker}' expands via the taxonomy to the set of terms (words,
sequences of words, etc.) that "fall under" that "concept" in the drug
taxonomy and similarly for '{cardiac_condition}' under the condition
taxonomy.

A good search engine would then return (for any document in which the query
is matched), the exact string(s) matching the query (e.g., 'thrombosis' or
'infarction' in the case of '{cardiac_condition}').  That is, from a very
general query (phrased in terms of 'concepts' or 'categories'), you would
get returned associations of specific terms and phrases.  Actually, Verity
does this pretty nicely once you transform your taxonomy into a Verity
topic set.

You can then take these specific associations you have identified
(retrieved) and see what generalizations they fall under from the point of
view of the taxonomy -- hoping to identify associations between classes of
drugs and classes of conditions.  (How you do this, I omit here.)

Ideally, your initial search should then simply be the most general one
possible -- say of the form:

                    {drug} AND {condition}

(actually, probably not quite this simple; but you get the idea).  The
problem is that '{drug}' will expand into (logically) a disjunctive term of
60,000 subterms, and '{condition}' will likewise expand into a disjunctive
term of multiple thousands of terms.  Something logically equivalent to:

                  (drug_1 OR drug_2 ... OR drug_60000) AND (condition_1 OR
condition_2 OR ... condition_5000)

Verity's implementation of their query constructor (it generates a machine
to do the matching) imposes a limit of 1,024 child nodes of any single
disjunctive node (roughly speaking) and a collective limit of (16,000/3)
nodes for a topic.  Prior to hitting the limit, Verity does just swell.

So, with that much more context, the question can now be rephrased as to
whether there is any problem with Lucene handling queries such as the one
above where there are disjunctive sub-queries with that many terms.

You can see, I think, that this has nothing to do with categorization (at
least in the usual sense).  It is, in fact, an attempt to use a search
engine to accomplish information extraction.  I was hoping to do this in
order to get some quick and relatively easy results -- and I could if
Verity didn't have these scaleability problems.  The one suggestion I have
seen so far in the responses that seems relevant to the problem is the
suggestion to transform the taxonomy (taxonomies) into an Analyzer.
Without looking at the implementation of Lucene I can't say how easy or
successful that would be.  Certainly it would be possible to transform any
such taxonomy into a FSA representation that could serve as what I
understand an Analyzer to be.  But I was hoping that perhaps that was what
Lucene already did in building a query!!  If I have to generate my own FSAs
from my own queries, and have to implement such features as stemming and
the like as part of the process, it's not obvious that I shouldn't just do
it from scratch -- in which case I would most likely do it in C (or
Python+C) rather than Java,  with an eye towards increased performance.

Any additional insight will be appreciated.

(If you are interested and can get a copy of it, there is an article
describing various features of this project:  "The Babylon Project:  Toward
an Extensible Text-Mining Platform".
http://computer.org/itpro/it2003/f2toc.htm)

--------------------------------------
Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to