Re: [Bioc-sig-seq] read sequences from the web

Michael Lawrence Fri, 05 Feb 2010 11:01:17 -0800

The state of the R connection framework is lamentable, but maybe we do not
need to rely on R for this. There are plenty of other libraries out there
that provide abstract connections with multiple backends. Some of them are
pretty heavy, like GIO and Qt, but the UCSC library also has some nice
routines, especially for access over HTTP. If we ever put together a common
I/O package, I think it would want to provide abstract connections, somehow.


Michael

2010/2/5 Laurent Gautier <[email protected]>

> On 2/5/10 6:43 PM, Hervé Pagès wrote:
>
>> Hi Thomas,
>>
>> Oops, some recent speed improvements to the read.*StringSet() family
>> that turn out to be regressions for your use case, sorry!
>>
>> Back in November I re-implemented in C the FASTA parser used by the
>> read.*StringSet() family to make it faster. Now it's 10x or 20x
>> faster (I don't remember exactly) to load Human chr1 from a FASTA
>> file. Because handling R connections in C is not easily doable
>> right now (the C code in R that handles these connections has not
>> been designed to be easily reusable in a package),
>>
>
> This is surfacing occasionally on the R-devel mailing-list, with even
> someone contributing a patch. All seems to have been largely ignored, may be
> because a critical mass has not been reach (I am still trying to rationalize
> ;-) ). May be you'll have a strategy to have it pushed through.
>
>
>  this FASTA parser
>> uses standard C facilities to read the file, with all the restrictions
>> that this implies. For example the file must be local, no more URLs,
>> pipes, fifos, socket connections, etc... all the fancy stuff
>> supported by R connections (see ?file).
>>
>> I under estimated the value of supporting URLs so I'll work on a fix
>> to at least support those (the fix will consist in downloading
>> the file first to a temp file, nothing fancy). I'll post again here
>> when this is ready.
>>
>> Cheers,
>> H.
>>
>>
>> Thomas Girke wrote:
>>
>>> Dear Biostrings Developers,
>>>
>>> There seems to be a change (bug?) in the behavior of the
>>> read.XXStringSet functions
>>> in the latest Biostrings version when pointing to files on the web.
>>> For instance:
>>> ## This works under R-2.10.0
>>> library(Biostrings)
>>> read.AAStringSet("
>>> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa
>>> ",
>>> "fasta")
>>> ## But the same command under R-2.10.1 returns the following error:
>>> Error in .read.fasta.in.XStringSet(filepath, set.names, elementType,
>>> lkup) :
>>> cannot open file
>>> '
>>> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa
>>> '
>>>
>>>
>>> My session info for R-2.10.0 is:
>>>
>>> R version 2.10.1 (2009-12-14) x86_64-unknown-linux-gnu
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
>>> LC_COLLATE=en_US.UTF-8 LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>> LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>> other attached packages:
>>> [1] Biostrings_2.14.10 IRanges_1.4.9
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1
>>>
>>>
>>> Thanks in advance for your help.
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] read sequences from the web

Reply via email to