Good day,

If I have a FASTA file that contains

>sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens 
>GN=TAS2R10 PE=1 SV=3
MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
RKNLRVT

readFasta fails to import it with the warning

proteins <- readFasta('.', "test.fasta")

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
  reading FASTA file test.fasta: ignored 129 invalid one-letter sequence codes

Also, the amino acid sequence is incomplete. There are 308 amino acids, but 

> width(proteins)
[1] 178

It's undesirable for users that some amino acids are discarded. Hopefully, they 
notice the warning message before proceeding with the analysis.

Admittedly, readFasta is in ShortRead, so is designed to work with high 
througput sequencing reads. But, perhaps it would be better suited to a 
infrastructure package such as Biobase and generalised to correctly import any 
FASTA file. There's even a Bioconductor workflow at 
https://www.bioconductor.org/help/workflows/sequencing/ which has a section 
titled "DNA/amino acid sequence from FASTA files" and demonstrates the use of 
readFasta.

I used version 1.34.2 of ShortRead which is the newest one.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to