Hi David, Thanks again, for the hints. Great.
I found dbxflat behaves well, goes fast and and makes small indices when only id,acc are asked for. But Genbank/EMBL have become 500gb+ monsters uncompressed, and so I made this primitive scheme in addition: split the flatfiles into many smaller compressed files organised in directories that are the first 4 digits of the GI number. Then with grep and zcat as "accessors", and 5-10 mb chunks, the average access time is 0.1-0.2 seconds - much worse than dbxflat, but better than fetching posts from NCBI, and then its 100gb instead of 500, close to its distributed compressed size. I would have used EMBL if EBI's remote services worked reliably. Btw, the seqret documentation doesnt say, but stdin: works as stdout: zcat 2.gz | seqret -filter stdin:AAIY01677200 -sbegin1 11 -osformat2 embl -firstonly Is adding to indices on the todo-list for dbxflat? Niels L _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
