My first impression is that you need a proper DB and a search on top of it (but not using the DB/SQL). Perhaps you could try these - 1) http://www.opensymphony.com/compass/content/about.html 2) http://kasparov.skife.org/blog/2004/09/11/#lucene-ojb 3) http://www.dbsight.net/
Please let us know if you find any other useful information in your search. - SJ On Jan 24, 2008 5:59 PM, yarongolan <[EMAIL PROTECTED]> wrote: > > Hi, > > (Warning, not for the weak-hearted) > > I'm currently working on a project where we have a large and complex data > model, related to Genomics. We are trying to build a search engine that > provides "full text" and "field-based text" searches for our customer base > (mostly academic research), and are evaluating different tools for this > purpose. > > As a starting point, we have, as an example, a set of objects (stored in > tables as a relational model): > Gene [ID, Symbol, Description] > Article - M:M with Gene [ID, Title] > Disease - M:M with Gene [ID, Name] > Author - M:M with Article [ID, Name] > (Note: M:M tables exist, just link IDs) > > An example model would be (hierarchical, relations dealt with as > duplications) > > Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] > Article [ID=1, Title=EGFR mutations in lung cancer: correlation with > clinical response to gefitinib therapy] > Author [ID=1, Name=H. Michaelson] > Author [ID=2, Name=J. Watson] > Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by > target class-selective prefractionation and tandem mass spectrometry] > Author [ID=1, Name=H. Michaelson] > Author [ID=3, Name=M. Roberts] > Disease [ID=1, Name=Epidermal sluffing] > > Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] > Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine > hydrolase: implications for the three-dimensional structure] > Author [ID=4, Name=B. Cohen] > Author [ID=5, Name=L. Alexander] > Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by > target class-selective prefractionation and tandem mass spectrometry] > Author [ID=1, Name=H. Michaelson] > Author [ID=3, Name=M. Roberts] > > Note IDs in the objects above, as they relay the relations in the > hierarchical model. > > In our Full-Text search, we would like to allow users to search ANY textual > field for any string. For instance, the term "epidermal", and display the > list of genes which have any data associated with them with that term > (ranked, of course). > Our list of results would be something like: > > EGFR > Found in Description (epidermal growth factor receptor) > Found in Article ID#2, in Title (proteomics analysis of epidermal protein > kinases by target class-selective prefractionation and tandem mass > spectrometry) > Found in Disease ID#1, in Name (Epidermal sluffing) > > AHCY > Found in Article ID#2, in Title (proteomics analysis of epidermal protein > kinases by target class-selective prefractionation and tandem mass > spectrometry) > > Note that the results retain a hierarchial view of our Genes (us being > Gene-Centric, we're pretty much framing the question "find this term related > in information related to those genes"). Also note that Article ID #2 has an > M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, > AHCY is considered a gene that has "epidermal" in its annotations. > > Obviously, we'd like to rank fields by location in hierarchy (A term in a > gene name is scored higher than the name of the author of an article related > to a gene) and by number of hits (number of times a term is found related to > that gene, 3 in the case of EGFR above). > > Ideas for how to take on this challenge? Implementation? Tools? > > Thanks! > Yaron Golan
