Hi David,

Thanks again, for the hints. Great.

I found dbxflat behaves well, goes fast and and makes small indices
when only id,acc are asked for. But Genbank/EMBL have become 500gb+
monsters uncompressed, and so I made this primitive scheme in addition:
split the flatfiles into many smaller compressed files organised in
directories that are the first 4 digits of the GI number. Then with
grep and zcat as "accessors", and 5-10 mb chunks, the average access
time is 0.1-0.2 seconds - much worse than dbxflat, but better than
fetching posts from NCBI, and then its 100gb instead of 500, close
to its distributed compressed size. I would have used EMBL if EBI's
remote services worked reliably. Btw, the seqret documentation
doesnt say, but stdin: works as stdout:

zcat 2.gz | seqret -filter stdin:AAIY01677200 -sbegin1 11 -osformat2 embl 
-firstonly

Is adding to indices on the todo-list for dbxflat?

Niels L


_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to