Re: [Bioc-sig-seq] read sequences from the web

Hervé Pagès Tue, 09 Feb 2010 23:29:31 -0800

Hi Thomas,

In Biostrings 2.15.21, read.*StringSet() works again with remote
files:

> aaset <-read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa";)trying URL'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'

ftp data connection made, file length 770075 bytes
opened URL
==================================================
downloaded 752 Kb

> aaset[1:3]
  A AAStringSet instance of length 3

width seq names[1] 401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNPgi|10579650|gb|AA...[2] 221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHTgi|10579651|gb|AA...[3] 369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGEgi|10579652|gb|AA...


Note that I'm using download.file() in the background with quiet=FALSE
(the default) hence the verbose output and progress bar.

Cheers,
H.


Thomas Girke wrote:

Thanks Hervé. - For me, URL-based sequence imports are useful mainly for demopurposes. For now, I can certainly work around this limitations by using stepwisedownloads and imports. As usual, speed matters more in this area than convenience...
Best,
Thomas


On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
Hi Thomas,

Oops, some recent speed improvements to the read.*StringSet() family
that turn out to be regressions for your use case, sorry!

Back in November I re-implemented in C the FASTA parser used by the
read.*StringSet() family to make it faster. Now it's 10x or 20x
faster (I don't remember exactly) to load Human chr1 from a FASTA
file. Because handling R connections in C is not easily doable
right now (the C code in R that handles these connections has not
been designed to be easily reusable in a package), this FASTA parser
uses standard C facilities to read the file, with all the restrictions
that this implies. For example the file must be local, no more URLs,
pipes, fifos, socket connections, etc... all the fancy stuff
supported by R connections (see ?file).

I under estimated the value of supporting URLs so I'll work on a fix
to at least support those (the fix will consist in downloading
the file first to a temp file, nothing fancy). I'll post again here
when this is ready.

Cheers,
H.


Thomas Girke wrote:
Dear Biostrings Developers,
There seems to be a change (bug?) in the behavior of the read.XXStringSetfunctionsin the latest Biostrings version when pointing to files on the web.For instance:
## This works under R-2.10.0
library(Biostrings)
read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa";, "fasta")
## But the same command under R-2.10.1 returns the following error:
Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup):cannot open file'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
My session info for R-2.10.0 is:
R version 2.10.1 (2009-12-14)x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=CLC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=CLC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=CLC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=Cattached base packages:[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.14.10 IRanges_1.4.9
loaded via a namespace (and not attached):
[1] Biobase_2.6.1


Thanks in advance for your help.

Thomas

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] read sequences from the web

Reply via email to