Thanks for these suggestions. The ideas of adding taxonomy-related terms to the documents is an interesting one and bears some thought. However, if I have to pre-process the corpus to determine which terms to add, and then to add them, it would seem that I've already accomplished my primary goal and don't need an indexer and search engine. Remember: this is not really an information retrieval application (with document-level granularity) that is being contemplated here, but an information extraction and text/data mining application (with "fact-level" granularity). My hope was to leverage a search engine, guided by taxonomies, to accomplish this at least as a first cut.
I do find Morus's suggestion to do an "inverse expansion" of terms in the index at indexing time to be very intriguing as well. Perhaps it is also what was meant by Ype's suggestion about adding stuff to the document (meaning adding stuff to the index). It appears I will also need to handle my own identification of matched terms. Verity, too, supports term highlighting -- but I am not at all certain they return information concerning the exact string that triggered the highlighted match. Perhaps if the "inverse expansion" approach can be made to work, it would eliminate this need. And it might also eliminate the need for the very large queries. The details are unclear at this point, but the possibilities are interesting. The suggestion of Jython is also appreciated and I was considering it already. I have not used Jython yet, but have developed all of my ontology/taxonomy/dictionary/thesaurus translation tools in Python (and yes, I do know the differences among all of these). I've even started to develop some of my interface stuff in Tkinter, but if I'm going to go the Java route I'll probably abandon that in favor of Swing. Well, I can see that I have a bit of work to do. I do have an undergraduate and a graduate student here at NC State working with me, and perhaps I can squeeze some of this work out of them :-). -------------------------------------- Gary H. Merrill Director and Principal Scientist, New Applications Data Exploration Sciences GlaxoSmithKline Inc. (919) 483-8456 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
