Hi Thomas,

In Biostrings 2.15.21, read.*StringSet() works again with remote
files:

> aaset <- read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa";) trying URL 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
ftp data connection made, file length 770075 bytes
opened URL
==================================================
downloaded 752 Kb

> aaset[1:3]
  A AAStringSet instance of length 3
width seq names [1] 401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNP gi|10579650|gb|AA... [2] 221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHT gi|10579651|gb|AA... [3] 369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGE gi|10579652|gb|AA...

Note that I'm using download.file() in the background with quiet=FALSE
(the default) hence the verbose output and progress bar.

Cheers,
H.


Thomas Girke wrote:
Thanks Hervé. - For me, URL-based sequence imports are useful mainly for demo purposes. For now, I can certainly work around this limitations by using stepwise downloads and imports. As usual, speed matters more in this area than convenience...

Best,
Thomas


On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
Hi Thomas,

Oops, some recent speed improvements to the read.*StringSet() family
that turn out to be regressions for your use case, sorry!

Back in November I re-implemented in C the FASTA parser used by the
read.*StringSet() family to make it faster. Now it's 10x or 20x
faster (I don't remember exactly) to load Human chr1 from a FASTA
file. Because handling R connections in C is not easily doable
right now (the C code in R that handles these connections has not
been designed to be easily reusable in a package), this FASTA parser
uses standard C facilities to read the file, with all the restrictions
that this implies. For example the file must be local, no more URLs,
pipes, fifos, socket connections, etc... all the fancy stuff
supported by R connections (see ?file).

I under estimated the value of supporting URLs so I'll work on a fix
to at least support those (the fix will consist in downloading
the file first to a temp file, nothing fancy). I'll post again here
when this is ready.

Cheers,
H.


Thomas Girke wrote:
Dear Biostrings Developers,

There seems to be a change (bug?) in the behavior of the read.XXStringSet functions in the latest Biostrings version when pointing to files on the web. For instance:
## This works under R-2.10.0
library(Biostrings)
read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa";, "fasta")
## But the same command under R-2.10.1 returns the following error:
Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup) : cannot open file 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'

My session info for R-2.10.0 is:

R version 2.10.1 (2009-12-14) x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.14.10 IRanges_1.4.9
loaded via a namespace (and not attached):
[1] Biobase_2.6.1


Thanks in advance for your help.

Thomas

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to