Hi, We do a lot of searching of protein databases, searching for distant homologs. If we send you protein sequences, can you search a protein database (NR)?
Chris On Feb 11, 2008, at 3:56 PM, Theodore H. Smith wrote: > > On 11 Feb 2008, at 22:28, Ryan Golhar wrote: > >> Why don't you write up a paper describing the algorithm in detail and >> submit it to a bioinformatics journal? And, why not make the >> executable >> available with documentation so that people can download it and try >> it >> out for themselves. >> >> Do you have any test cases that show it runs faster/better than >> BLAST? >> Describe them and make them available. > > The first thing I'd need to do is make a good test. I'm not sure what > constitutes "a good test", in this case. > > How big should the databanks be to make the test reasonable? Is > randomly generated data good enough, or is a randomly selected sample > better. If a sample is better, how large a dataset must I gather to do > the test. > > Perhaps certain settings make my algorithm work better or worse > relative to BLAST. But then how do I know which settings are more > likely to be used and which aren't? > > I think someone who uses BLAST frequently, and knows it well from a > user's perspective... might have a better feel for creating a test > than I might. > > The worst thing that could happen is I make a test, which is unfairly > prejudiced to my algorithm :) The next thing that would happen is > people would see my test has "suspiciously good" results, and... be > annoyed about that, and lose interest, even if it were an innocent > mistake on my end. I'd rather avoid that sort of mistake by getting a > knowledged eye in the designing of a test! > > Like I said, I haven't gotten all the code in C++ yet. I've got a > framework in C++ already, I mean I know how to write C++. And I know > what to do, as I've written it in a proto-typing language. > > The C++ version will come soon, though. > >> Theodore H. Smith wrote: >>> Hi everyone, >>> >>> So I've been working, on and off, on this algorithm for quite a >>> while >>> now. It's basically an invention of mine. It is a "blast-like" >>> algorithm, in that it does "Fuzzy lookup" operations across a >>> database >>> of letters. I am designing this algorithm to be useful for bio- >>> informatics, this is the main field I am initially targetting. >>> >>> The database will be filled with protein sequences, and the search >>> across the database will be another protein sequence. The algorithm >>> has a "scoring matrix", which can accept different protein >>> replacement >>> scores. The cost of inserting letters (protein letters) can be >>> configured also. >>> >>> In this sense, it's no different to Smith-Waterman. The same input, >>> the same output! >>> >>> The real difference from Smith-Waterman, is it's speed. My algorithm >>> will be hugely faster. This is because I use many techniques to >>> avoid >>> processing unnecessary parts of the Smith-Waterman matrix. >>> >>> I also use many tricks to reuse computations across various >>> proteins. >>> For example, the matrix for protein "ABCDE", is identical, at first >>> anyhow, for the matrix for "ABCDEFG". This means if I have both >>> proteins "ABCDE", and "ABCDEFG" in my protein database, I can test >>> both of them against the search query, in almost half the time. My >>> algorithm also runs in logarithmic-time with respect to the size of >>> the database. Basically, bigger databases run disproportionately >>> faster. >>> >>> I want to turn this algorithm, into something useful for people. My >>> first challenge here, is to answer the question "is this algorithm >>> faster, or better than BLAST". If it is not faster, my algorithm >>> basically has little use. But I have good hopes it will be faster! I >>> am very good with these sort of things, you see :) Speed is my >>> strong- >>> point. >>> >>> Currently, I do not know about the speed, because I haven't >>> implemented a C++ version of my algorithm, or a good speed testing >>> framework. >>> >>> I do however know that my algorithm is more accurate than BLAST, >>> because it is just as accurate as SSEARCH, as mine uses the Smith- >>> Waterman algorithm. Whereas BLAST uses a heuristic, intelligent >>> guess- >>> work basically. A fine heuristic, but still a heuristic. Mine is >>> methodological, not heuristic based. >>> >>> So here is what I am looking for! >>> >>> I am hoping, that someone in the field will be able to offer me >>> guidance, interest, enthusiasm, suggestions and maybe even do some >>> testing for me. >>> >>> Perhaps a student doing a bio-informatics related degree, who would >>> like to write a paper on an alternative way of processing protein >>> databases. My invention could be an interesting subject for a paper. >>> >>> Or perhaps a researcher who just has an interest in these sort of >>> things! Perhaps a researcher who feels there must be a better way of >>> doing these things. Or anyone really in this field with the time and >>> interest, and feels helping me could help him (or her) too in some >>> way. >>> >>> I'd like someone I can ask a lot of questions to, and show my >>> software >>> to, and explain my hopes what I can achieve with it. >>> >>> Basically, my first question to you, would be "how would I set this >>> up >>> to be useful for someone", and "how would I test it's usefulness, >>> what >>> would you need to know about my algorithm that you would decide to >>> use >>> it over blast" >>> >>> It's sort of a vague question from me, like "what do you need me to >>> do", but... well that's where I am right now. Sort of a bit on the >>> outside hoping someone on the inside will show me something. >>> >>> So it's an opportunity to tell me what you want, basically!! Tell >>> me, >>> and I might just make it. >>> >>> Who knows? Maybe one day in a few years time, everyone will be using >>> this "ElfDataFuzzy" algorithm that I invented, instead of BLAST! You >>> might be part of something. >>> >>> Thanks to anyone who replies! >>> >>> -- >>> http://elfdata.com/plugin/ >>> "String processing, done right" >>> >>> >>> >>> _______________________________________________ >>> BBB mailing list >>> [email protected] >>> http://www.bioinformatics.org/mailman/listinfo/bbb >>> >>> >> >> >> _______________________________________________ >> BBB mailing list >> [email protected] >> http://www.bioinformatics.org/mailman/listinfo/bbb > > -- > http://elfdata.com/plugin/ > "String processing, done right" > > > > _______________________________________________ > BBB mailing list > [email protected] > http://www.bioinformatics.org/mailman/listinfo/bbb Chris Upton Ph.D. Associate Professor Biochemistry and Microbiology Tel. 250-721-6507 University of Victoria Fax 250-721-8855 P.O. Box 3055 STN CSC Victoria, BC V8W 3P6 Canada web.uvic.ca/~cupton www.virology.ca www.biodirectory.com/uptons_blog.html _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
