Hi all, Is there a tool in EMBOSS to just count the number of sequences in a file? For simple file formats like FASTA or GenBank I'd typically just use grep:
$ grep -c "^LOCUS " gbvrt1.seq 31065 However, this becomes more complicated for general file formats (e.g. FASTQ files where in addition to identifiers the quality lines can also start with @) or binary files like BAM which EMBOSS now supports. Right now I could handle this by using seqret to convert the file into FASTA and then pipe that though grep to count the records. But an EMBOSS tool would be more elegant, e.g. $ countseq -sformat=genbank gbvrt1.seq 31065 For the implementation you might offer the choice between using the normal EMBOSS parsing (as in seqret) versus file format specific regular expression searches which just look for marker lines (without checking validity) which should be really fast. Regards, Peter C. _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
