[R] Unexpected behaviour in reading genomic coordinate files of R-2.7.0

Margherita Wed, 28 May 2008 03:14:05 -0700

Great R people,

I have noticed a strange behaviour in read.delim() and friends in the R2.7.0 version. I will describe you the problem and also the solution Ialready found, just to be sure it is an expected behaviour and also totell people, who may experience the same difficulty, a way to overcome it.

And also to see if it is a proper behaviour or maybe a correction is needed.


Here is the problem:

I have some genomic coordinates files (bed files, a standard format, forexample) containing a column (Strand) in which there is either a "+" ora "-".In R-2.6.2patched (and every past version I have used) I never hadproblems in reading them in, as for example:

> a <- read.table("coords.bed", skip=1)
> disp(a)
class  data.frame
dimensions are  38650 6
first rows:
   V1        V2        V3    V4 V5 V6
1 chr1 100088396 100088446  seq1  0  +
2 chr1 100088764 100088814  seq2  0  -

If I do exactly the same command on the same file in R-2.7.0 the resultI obtain is:

> a <- read.table("coords.bed", skip=1)
> disp(a)
class  data.frame
dimensions are  38650 6
first rows:
   V1        V2        V3    V4 V5 V6
1 chr1 100088396 100088446  seq1  0  0
2 chr1 100088764 100088814  seq2  0  0

and I completely loose the strand information, they are all zeros! Ihave also tried to put quotes around "+" and "-" in the file beforereading it, to set in read.table() call stringsAsFactors=FALSE, to set"encoding" to a few different alternatives, but the result was alwaysthe same: they are all transformed in 0.


Then I tried scan() and I saw it was reading the character "+" properly:
> scan("coords.bed",  skip=1, nlines=1, what="ch")
Read 6 items

[1] "chr1" "100088396" "100088446.00" "seq1" "0"[6] "+"...my conclusion is that the lone "+" or "-" are not taken as"characters" in the data frame creation step, they are taken as"numeric" but, being without numbers are all converted to 0.Is it correct if this behaviour happens also if they are surrounded byquotes?

Anyway, my temporary solution (which works without the need of changingthe files) is:a <- read.table("coords.bed", skip=1, colClasses=c("character","numeric", "numeric", "character", "numeric", "character"))

> a[1:2,]
   V1        V2        V3    V4 V5 V6
1 chr1 100088396 100088446 seq1  0  +
2 chr1 100088764 100088814 seq2  0  -

Another way to avoid loosing strand information was to manuallysubstitute an "R" to "-" and an "F" to "+" in the file before reading itin R. But it is much more cumbersome since the use of + and - is, forexample, a standard format in bed files accepted and generated by theGenome Browser and other genome sites.

Please let me know what do you think. Ps. I saw this first in the Fedoraversion (rpm automatically updated), but it is reproduced also in theWindows version.

Thank you all people for your work and for making R the wonderful toolit is!


Cheers,

Margherita

--
--
-----------------------------------------------------------------------------------

Margherita Mutarelli, PhDSeconda Universita' di Napoli

Dipartimento di Patologia Generale
via L. De Crecchio, 7
80138 Napoli - Italy
Tel/Fax. +39.081.5665802

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Unexpected behaviour in reading genomic coordinate files of R-2.7.0

Reply via email to