Re: [R-sig-phylo] suset DNAbin

Emmanuel Paradis Mon, 05 Sep 2016 12:59:41 -0700

Hi Dan,

You don't need to convert to character to manipulate DNAbin objects: infact, woodmouse is a just matrix like others.


R> dim(woodmouse)
[1]  15 965
R> is.matrix(woodmouse)
[1] TRUE
R> dimnames(woodmouse)
[[1]]
 [1] "No305"   "No304"   "No306"   "No0906S" "No0908S" "No0909S" "No0910S"
 [8] "No0912S" "No0913S" "No1103S" "No1007S" "No1114S" "No1202S" "No1206S"
[15] "No1208S"

[[2]]
NULL

R> X <- woodmouse[c('No305', 'No304', 'No306'), ]
R> identical(subsetDNAb, X)
[1] TRUE

The (quite crucial) difference is that "DNAbin" requires about 10 timesless memory:


R> object.size(woodmouse)
15944 bytes
R> object.size(mouseMat)
117296 bytes

So with (very) big data sets, this makes a difference.

As a side note, the next version of ape will be able to handle longvectors for DNAbin objects with more than 2.1 billion bases, andread.dna will be able to read files larger than 2.1 Gb.


Best,

Emmanuel

Le 05/09/2016 à 20:13, dga...@huskers.unl.edu a écrit :

Hi Kirston,


I generally convert DNAbin into general R objects like matrices, lists, and 
vectors for my subsetting so I don't have to make DNAbin specific functions. I 
typically use as.character() which converts DNAbin to matrix, then as.DNAbin() 
which converts matrix back to DNAbin


example:

library(ape)

data(woodmouse)

mouseMat<-as.character(woodmouse)

dim(mouseMat)

[1]  15 965


#then do your normal subsetting

subsetMouse<-mouseMat[c('No305','No304','No306'),]

dim(subsetMouse)

[1]   3 965


#then convert it back to DNAbin

subsetDNAb<-as.DNAbin(subsetMouse)
subsetDNAb

3 DNA sequences in binary format stored in a matrix.

All sequences of same length: 965

Labels: No305 No304 No306

Base composition:
    a     c     g     t
0.306 0.260 0.126 0.307



Cheers

-Dan




________________________________
From: R-sig-phylo <r-sig-phylo-boun...@r-project.org> on behalf of Kirston Barton 
<kirston.bar...@sydney.edu.au>
Sent: Monday, September 5, 2016 1:07:25 AM
To: r-sig-phylo@r-project.org
Subject: [R-sig-phylo] suset DNAbin

Hi,

I have my data in a fasta file and am importing it into R using read.dna, which 
creates a DNAbin matrix object. I would like to subset my file depending on the 
sequence name so that I can generate the nucleotide pairwise distance using 
dist.dna. I have attempted to do this using grep, but all I get is a list of 
the numbers of the sequences with the correct name and no sequences or sequence 
names. Does anyone have a suggestions for an easy way to do this?

For example, my DNAbin object has the following row names:

[1] "01011-DNA1.Contig1"    "01011-DNA11.Contig1"   "01011-DNA12.Contig1"   
"01011-DNA13.Contig1"   "01011-DNA14.Contig1"
  [6] "01011-DNA16.Contig1"   "01011-DNA17.Contig1"   "01011-DNA18.Contig1"   
"01011-DNA19.Contig1"   "01011-DNA2.Contig1"
 [11] "01011-DNA20.Contig1"   "01011-DNA21.Contig1"   "01011-DNA22.Contig1"   
"01011-DNA23.Contig1"   "01011-DNA24.Contig1"
 [16] "01011-DNA25.Contig1"   "01011-DNA26.Contig1"   "0103-PRNA2.Contig1"    
"01011-DNA3.Contig1"    "01011-DNA33.Contig1"
 [21] "01011-DNA4.Contig1"    "01011-DNA5.Contig1"    "01011-DNA6.Contig1"    
"01011-DNA7.Contig1"    "01011-DNA8.Contig1"
 [26] "01011-DNA9.Contig1"    "01011-RNA10.Contig1"   "01011-RNA13.Contig1"   
"01011-RNA14.Contig1"   "01011-RNA17.Contig1"
 [31] "01011-RNA18.Contig1"   "01011-RNA19.Contig1"   "01011-RNA21.Contig1"   
"01011-RNA23.Contig1"   "01011-RNA24.Contig1"
 [36] "01011-RNA26.Contig1"   "01011-RNA28.Contig1"   "01011-RNA29.Contig1"   
"01011-RNA30.Contig1"   "01011-RNA31.Contig1"
 [41] "01011-RNA32.Contig1"   "01011-RNA33.Contig1"   "01011-RNA35.Contig1"   
"01011-RNA38.Contig1"   "01011-RNA4.Contig1"
 [46] "01011-RNA5.Contig1"    "01011-RNA6.Contig1"    "01011-RNA8.Contig1"    
"01011-RNA9.Contig1"    "0102A-CRNA103.Contig1"
 [51] "0102A-CRNA105.Contig1" "0102A-CRNA110.Contig1" "0102A-CRNA113.Contig1" 
"0102A-CRNA115.Contig1" "0102A-CRNA118.Contig1"
 [56] "0102a-DNA10.Contig1�


I would like a new DNAbin object with sequences that have 1011 anywhere in 
their row name.

Please let me know if i have left out any pertinent information. Thank you in 
advance for any suggestions or help with this matter.

Kind regards,
Kirston
_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

        [[alternative HTML version deleted]]










_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] suset DNAbin

Reply via email to