>>>>> "Jeffrey" == Jeffrey Rosenfeld <[EMAIL PROTECTED]> writes:
Jeffrey> I am new to this list, so my question might have already Jeffrey> been discussed, but I cannot find any reference to it in Jeffrey> the archive, so here goes: I am trying to find a quick Jeffrey> java-only way to retrieve sequences from a blast Jeffrey> database. I am writing a program that needs to obtain Jeffrey> large amounts of sequences from a fairly large database. Jeffrey> I have tried using fastacmd, but there is a great Jeffrey> slowdown because of teh need to start up an external Jeffrey> process for each sequence query. (I cannot execute one Jeffrey> large fastacmd job because of the large amounts of Jeffrey> sequence that I am querying. ) I know that biojava has Jeffrey> many different formats for storing sequences, but I don't Jeffrey> want to have to keep two databases of my sequences Jeffrey> updated. I am already using the blast database for Jeffrey> blast, so I don't want another database. Is there a Jeffrey> simple way to implement fastacmd or somethign similar in Jeffrey> java? It should not be too hard to do either using JNI Jeffrey> or reverse engineering the fastacmd code. Hi Jeffrey, This is possible, but you would at least need to make a new (additional) index of the Blast database. Biojava does not have a reader for blast indices because their format is different between ncbi/wu flavours and is also apt to change. Brief background on the available indices - we started with our own system (see interfaces org.biojava.bio.seq.db.Index, org.biojava.bio.seq.db.IndexStore and the TabIndexStore implementation of IndexStore). Later an indexing system common to all the Bio* projects was proposed and implemented (i.e. you can index with Bioperl and read in Biopython etc). See the obf-common cvs package for a full spec and other docs via webcvs at http://cvs.open-bio.org. This is quite heavily integrated with a system-wide registry for local and distributed databases (also described in obf-commion docs), which you won't need to worry about as you just want a simple lookup. To use this system... there is an end-user indexing program org.biojava.app.BioFlatIndex which can create the index (actually a directory containing metadata and offsets into sequence files). Alternatively you can programmatically index using the org.biojava.bio.program.indexdb.IndexTools class. See the unit tests (in cvs, org.biojava.bio.program.indexdb.IndexToolsTest) for examples such as: public void testIndexFastaDNA() throws Exception { File [] files = getDBFiles(new String [] { "dna1.fasta", "dna2.fasta" }); IndexTools.indexFasta("test", new File(location), files, SeqIOConstants.DNA); SequenceDBLite db = new FlatSequenceDB(location, "dna"); Sequence seq1 = db.getSequence("id1"); assertEquals("gatatcgatt", seq1.seqString()); Sequence seq2 = db.getSequence("id2"); assertEquals("ggcgcgcgcg", seq2.seqString()); Sequence seq3 = db.getSequence("id3"); assertEquals("ccccccccta", seq3.seqString()); Sequence seq4 = db.getSequence("id4"); assertEquals("tttttcgatt", seq4.seqString()); Sequence seq5 = db.getSequence("id5"); assertEquals("ggttcgcgcg", seq5.seqString()); Sequence seq6 = db.getSequence("id6"); assertEquals("nnnnnnttna", seq6.seqString()); } Finally, the binary indices created by the Staden package and EMBOSS (Embl CDROM format) are also supported. If you index your flatfiles with dbifasta/dbiblast you can read the EMBOSS indices from Biojava with a little effort. This uses an EmblCDROM implmementation of our old IndexStore interface. The unit tests (org.biojava.bio.seq.db.EmblCDROMIndexStoreTest) should prove useful: URL divURL = EmblCDROMIndexStoreTest.class.getResource("emblcd/division.lkp"); URL entURL = EmblCDROMIndexStoreTest.class.getResource("emblcd/entrynam.idx"); File divisionLkp = new File(divURL.getFile()); File entryNamIdx = new File(entURL.getFile()); format = new FastaFormat(); alpha = ProteinTools.getAlphabet(); parser = alpha.getTokenization("token"); factory = new FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY); EmblCDROMIndexStore emblCDIndexStore = new EmblCDROMIndexStore(divisionLkp, entryNamIdx, format, factory, parser); emblCDIndexStore.setPathPrefix(entryNamIdx.getParentFile().getAbsoluteFile()); SequenceDB sequenceDB = new IndexedSequenceDB(emblCDIndexStore); and later... // Test actual sequence fetches Sequence seq = sequenceDB.getSequence("NMA0007"); assertEquals("NMA0007", seq.getName()); assertEquals(235, seq.length()); seq = sequenceDB.getSequence("NMA0020"); assertEquals("NMA0020", seq.getName()); assertEquals(494, seq.length()); seq = sequenceDB.getSequence("NMA0030"); assertEquals("NMA0030", seq.getName()); assertEquals(245, seq.length()); Hope this is useful, Keith -- - Keith James <[EMAIL PROTECTED]> bioinformatics programming support - - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK - _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l